Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58656

Wrapper process leaves zombie when no init process present

    Details

    • Similar Issues:

      Description

      The merge of PR-98 moved the wrapper process to the background to allow the launching process to quickly exit. However, that very act will orphan the wrapper process. This is only a problem in environments where there is no init process (e.g. docker containers that are run with no --init flag).

      Unit tests did not discover this bug due to a race condition of when the last ps was called and when the wrapper process exited. If another ps is called after the test detects that the script as finished running, the zombie state of the wrapper process is revealed.

      I'm not sure how much of an issue this really is as there are numerous solutions on enabling zombie-reaping for containers, but as there is an explicit check for zombies in the unit tests, it seemed worth mentioning.

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            Stephan Kirsten can you elaborate a bit here? You are using the kubernetes plugin? How many sh steps per pod? What kind of K8s installation?

            I do not want to have to document this; I want things to work out of the box. If adding this option to pod definitions reliably fixes what is otherwise a reproducible leak, and does not cause ill effects, then we can automate that in the kubernetes plugin. I see that there is a feature gate for this which might be turned off, and we would need to check what happens if the option is specified but the feature is disabled.

            Show
            jglick Jesse Glick added a comment - Stephan Kirsten can you elaborate a bit here? You are using the kubernetes plugin? How many sh steps per pod? What kind of K8s installation? I do not want to have to document this; I want things to work out of the box. If adding this option to pod definitions reliably fixes what is otherwise a reproducible leak, and does not cause ill effects, then we can automate that in the kubernetes plugin. I see that there is a feature gate for this which might be turned off, and we would need to check what happens if the option is specified but the feature is disabled.
            Hide
            stephankirsten Stephan Kirsten added a comment -

            We use the kubernetes plugin with kubernetes 1.15.3 on premise. Regarding sh steps per pod, we have only around 10, but invoke our build system via shell scripts which then work through Makefiles and invoke bash for every step of the Makefiles. It sums up to around 27k defunct bash processes that are not getting reaped and eventually we run into the error which i mentioned above.

            Show
            stephankirsten Stephan Kirsten added a comment - We use the kubernetes plugin with kubernetes 1.15.3 on premise. Regarding sh steps per pod, we have only around 10, but invoke our build system via shell scripts which then work through Makefiles and invoke bash for every step of the Makefiles. It sums up to around 27k defunct bash processes that are not getting reaped and eventually we run into the error which i mentioned above.
            Hide
            jglick Jesse Glick added a comment -

            So reading between the lines, repeatedly running a build which has one sh step that runs a script that launches a thousand subprocesses should eventually result in an error. That is something that can be tested and, if true, worked around.

            Show
            jglick Jesse Glick added a comment - So reading between the lines, repeatedly running a build which has one sh step that runs a script that launches a thousand subprocesses should eventually result in an error. That is something that can be tested and, if true, worked around.
            Hide
            kerogers Kenneth Rogers added a comment -

            I have been unable to reproduce the `Resource temporarily unavailable.` error when attempting to run pipelines that simulate the situation described.

            I created a cluster in GKE using the gcloud cli. gcloud container clusters create <cluster-name> --machine-type=n1-standard-2 --cluster-version=latest. Installed Cloudbees Core for Modern Platforms version 2.204.3.7 (latest public release at the time I started testing) using Helm. Used kubectl get nodes to find the names of the nodes and gcloud beta compute ssh to connect to the nodes via ssh. Then running watch 'ps fauxwww | fgrep Z' to watch for zombie processes on each node.

            Using groovy while (true) { sh 'sleep 1' } I was able to produce zombie processes on the node the build agent was assigned to. The process ran for 5 hours 17 minutes before using all the process resources. After the processes were exhausted the job exited with an error message that there were not processes available. After the pod running the job exited the zombie processes on the node were removed and the node continued to function.

            Using `while :; do /usr/bin/sleep .01; done` as a way to generate subprocesses I've tested as the direct parameter of an `sh` step in a pipeline using both `jenkins/jnlp-slave` and `cloudbees/cloudbees-core-agent` images. Neither produced any zombie processes on the worker nodes of the Kubernetes cluster. To induce another layer of subprocess I also put that `while` line into a file and had the `sh` process execute that file, but it also did not produce any zombie processes on the worker nodes. Additionally I made that while loop a step in a Makefile and executed it that way, which also did not produce any zombies on the nodes.

            Show
            kerogers Kenneth Rogers added a comment - I have been unable to reproduce the `Resource temporarily unavailable.` error when attempting to run pipelines that simulate the situation described. I created a cluster in GKE using the gcloud cli. gcloud container clusters create <cluster-name> --machine-type=n1-standard-2 --cluster-version=latest. Installed Cloudbees Core for Modern Platforms version 2.204.3.7 (latest public release at the time I started testing) using Helm. Used kubectl get nodes to find the names of the nodes and gcloud beta compute ssh to connect to the nodes via ssh. Then running watch 'ps fauxwww | fgrep Z' to watch for zombie processes on each node. Using groovy while (true) { sh 'sleep 1' } I was able to produce zombie processes on the node the build agent was assigned to. The process ran for 5 hours 17 minutes before using all the process resources. After the processes were exhausted the job exited with an error message that there were not processes available. After the pod running the job exited the zombie processes on the node were removed and the node continued to function. Using `while :; do /usr/bin/sleep .01; done` as a way to generate subprocesses I've tested as the direct parameter of an `sh` step in a pipeline using both `jenkins/jnlp-slave` and `cloudbees/cloudbees-core-agent` images. Neither produced any zombie processes on the worker nodes of the Kubernetes cluster. To induce another layer of subprocess I also put that `while` line into a file and had the `sh` process execute that file, but it also did not produce any zombie processes on the worker nodes. Additionally I made that while loop a step in a Makefile and executed it that way, which also did not produce any zombies on the nodes.
            Hide
            cshivashankar chetan shivashankar added a comment -

            I have been observing some issues on nodes.  I have been using amazon EKS for cluster and once in a while node ends up either with soft lockup or flips to NodeNotReady. I have tried to troubleshoot a lot but still now nothing concrete is figured out. I was working with AWS support and they told me that there were couple of other cases where they told similar behavior using Jenkins pods. One of the other patterns which I observed is, all the nodes which had issues had very high number of zombie processes, at least 4000+. I still haven't got any conclusive evidence to tell that issue is due to zombie process/Jenkins but the patterns all indicate that there could be something with k8s plugin of Jenkins which may be causing the issue.

            Did any of you face the same issue?

            Show
            cshivashankar chetan shivashankar added a comment - I have been observing some issues on nodes.  I have been using amazon EKS for cluster and once in a while node ends up either with soft lockup or flips to NodeNotReady. I have tried to troubleshoot a lot but still now nothing concrete is figured out. I was working with AWS support and they told me that there were couple of other cases where they told similar behavior using Jenkins pods. One of the other patterns which I observed is, all the nodes which had issues had very high number of zombie processes, at least 4000+. I still haven't got any conclusive evidence to tell that issue is due to zombie process/Jenkins but the patterns all indicate that there could be something with k8s plugin of Jenkins which may be causing the issue. Did any of you face the same issue?

              People

              • Assignee:
                Unassigned
                Reporter:
                carroll Carroll Chiou
              • Votes:
                1 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated: