Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58656

Wrapper process leaves zombie when no init process present

    Details

    • Similar Issues:

      Description

      The merge of PR-98 moved the wrapper process to the background to allow the launching process to quickly exit. However, that very act will orphan the wrapper process. This is only a problem in environments where there is no init process (e.g. docker containers that are run with no --init flag).

      Unit tests did not discover this bug due to a race condition of when the last ps was called and when the wrapper process exited. If another ps is called after the test detects that the script as finished running, the zombie state of the wrapper process is revealed.

      I'm not sure how much of an issue this really is as there are numerous solutions on enabling zombie-reaping for containers, but as there is an explicit check for zombies in the unit tests, it seemed worth mentioning.

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            This only happens within the container itself, so once the container goes away, so do the zombies.

            If true, then there is no problem; most pods run only one sh step, and it is rare for more than a handful to be run. My concern is whether after running a bunch of builds in a cluster, after all the pods are gone there are still zombie processes left over on the nodes, which would eventually cause the node to crash and need to be rebooted. That is why I asked specifically about

            get a root shell into the raw node and scan for zombie processes

            I just learned of shareProcessNamespace: true today. We could enable it automatically in agent pods if that is what we have to do, but first I would like to understand the current behavior.

            Show
            jglick Jesse Glick added a comment - This only happens within the container itself, so once the container goes away, so do the zombies. If true, then there is no problem; most pods run only one sh step, and it is rare for more than a handful to be run. My concern is whether after running a bunch of builds in a cluster, after all the pods are gone there are still zombie processes left over on the nodes, which would eventually cause the node to crash and need to be rebooted. That is why I asked specifically about get a root shell into the raw node and scan for zombie processes I just learned of shareProcessNamespace: true today. We could enable it automatically in agent pods if that is what we have to do, but first I would like to understand the current behavior.
            Hide
            stephankirsten Stephan Kirsten added a comment - - edited

            We also ran into this problem. We have long running builds based on hundreds of Makefiles which all invoke for every command a bash which isn't reaped anymore. After some time we exausted the pids and received following message: 

            fork: Resource temporarily unavailable

            The "shareProcessNamespace: true" solves this situation for us. But we think this should be documented in someway, because the behavior definitely changed.

            Show
            stephankirsten Stephan Kirsten added a comment - - edited We also ran into this problem. We have long running builds based on hundreds of Makefiles which all invoke for every command a bash which isn't reaped anymore. After some time we exausted the pids and received following message:  fork: Resource temporarily unavailable The "shareProcessNamespace: true" solves this situation for us. But we think this should be documented in someway, because the behavior definitely changed.
            Hide
            jglick Jesse Glick added a comment -

            Stephan Kirsten can you elaborate a bit here? You are using the kubernetes plugin? How many sh steps per pod? What kind of K8s installation?

            I do not want to have to document this; I want things to work out of the box. If adding this option to pod definitions reliably fixes what is otherwise a reproducible leak, and does not cause ill effects, then we can automate that in the kubernetes plugin. I see that there is a feature gate for this which might be turned off, and we would need to check what happens if the option is specified but the feature is disabled.

            Show
            jglick Jesse Glick added a comment - Stephan Kirsten can you elaborate a bit here? You are using the kubernetes plugin? How many sh steps per pod? What kind of K8s installation? I do not want to have to document this; I want things to work out of the box. If adding this option to pod definitions reliably fixes what is otherwise a reproducible leak, and does not cause ill effects, then we can automate that in the kubernetes plugin. I see that there is a feature gate for this which might be turned off, and we would need to check what happens if the option is specified but the feature is disabled.
            Hide
            stephankirsten Stephan Kirsten added a comment -

            We use the kubernetes plugin with kubernetes 1.15.3 on premise. Regarding sh steps per pod, we have only around 10, but invoke our build system via shell scripts which then work through Makefiles and invoke bash for every step of the Makefiles. It sums up to around 27k defunct bash processes that are not getting reaped and eventually we run into the error which i mentioned above.

            Show
            stephankirsten Stephan Kirsten added a comment - We use the kubernetes plugin with kubernetes 1.15.3 on premise. Regarding sh steps per pod, we have only around 10, but invoke our build system via shell scripts which then work through Makefiles and invoke bash for every step of the Makefiles. It sums up to around 27k defunct bash processes that are not getting reaped and eventually we run into the error which i mentioned above.
            Hide
            jglick Jesse Glick added a comment -

            So reading between the lines, repeatedly running a build which has one sh step that runs a script that launches a thousand subprocesses should eventually result in an error. That is something that can be tested and, if true, worked around.

            Show
            jglick Jesse Glick added a comment - So reading between the lines, repeatedly running a build which has one sh step that runs a script that launches a thousand subprocesses should eventually result in an error. That is something that can be tested and, if true, worked around.

              People

              • Assignee:
                Unassigned
                Reporter:
                carroll Carroll Chiou
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: