Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-28182

Revisit use of $JENKINS_SERVER_COOKIE and Launcher.kill

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      PlaceholderExecutable needs an environment variable to add to the context to ensure that Launcher.kill, called when the block exits, will clean up any stray external processes run on this node.

      Originally the name of this variable was chosen to be JENKINS_SERVER_COOKIE, which is set by Job (to a confidential hex key) and normally used by AbstractBuild for this purpose. Unlike in that case, for Workflow the value is overridden (to a random UUID) for each node block, since there may be several. I think the intent was to use the same variable name to suppress any unwanted kills coming from Jenkins core code, but there seem to be none. Anyway after the many refactorings of environment variable handling in Workflow, it turns out this is dead code!

      • FileMonitoringTask sets its own value for JENKINS_SERVER_COOKIE, to durable-<workspaceHash>, which overrides any other value, and is used by stop. So within a shell step, $JENKINS_SERVER_COOKIE is the one from durable-task, not PlaceholderExecutable.
      • env.JENKINS_SERVER_COOKIE picks up the value from Job, due to the precedence order in getEffectiveEnvironment.

      So it seems that while the intention is still good (we would like a final check that any processes started inside node are killed when the block exits), the implementation is not right. Probably PlaceholderExecutable just needs to pick an unrelated environment variable for this purpose. And this logic needs to be tested; for example:

      def file = new File(...);
      node {
        sh "(sleep 5; touch $file) &"
      }
      sleep 10
      assert !file.exists()
      

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            JENKINS-28131 also discusses passing some other environment variables from PlaceholderExecutable, so would be easy to do at the same time.

            Show
            jglick Jesse Glick added a comment - JENKINS-28131 also discusses passing some other environment variables from PlaceholderExecutable , so would be easy to do at the same time.
            Hide
            jglick Jesse Glick added a comment -

            The use of JENKINS-25938 in the lts-609 branch also moves the Launcher.kill to the finish method (where it belonged to begin with), and moves the definition of COOKIE_VAR, so there could be merge conflicts if done before that branch is merged.

            Show
            jglick Jesse Glick added a comment - The use of JENKINS-25938 in the lts-609 branch also moves the Launcher.kill to the finish method (where it belonged to begin with), and moves the definition of COOKIE_VAR , so there could be merge conflicts if done before that branch is merged.
            Hide
            mmitche Matthew Mitchell added a comment -

            This is a big issue in some deployments (windows especially) that launch VC++ and C# compiler processes, since these sometimes want to leave processes around after execution.  This can cause failures down the road.  I'm looking into this issue now based on the info above

            Show
            mmitche Matthew Mitchell added a comment - This is a big issue in some deployments (windows especially) that launch VC++ and C# compiler processes, since these sometimes want to leave processes around after execution.  This can cause failures down the road.  I'm looking into this issue now based on the info above
            Hide
            jglick Jesse Glick added a comment -

            Matthew Mitchell I think your comment is misplaced. Fixing this architectural / code style issue should not have any user-visible behavior.

            Probably what you are asking for is accomplished simply by unsetting the environment variable for selected processes spawned by your script.

            Show
            jglick Jesse Glick added a comment - Matthew Mitchell I think your comment is misplaced. Fixing this architectural / code style issue should not have any user-visible behavior. Probably what you are asking for is accomplished simply by unsetting the environment variable for selected processes spawned by your script.
            Hide
            mmitche Matthew Mitchell added a comment -

            Actually it does have user visible behavior.  I did a bunch of experimentation and instrumentation of Jenkins.

            The process killer is looking for the original value of JENKINS_SERVER_COOKE (some hash). But, all processes within the node get the durable-<hash>, including the problematic ones I'm seeing. I understand the intention here.  Multiple node {} blocks running concurrently on the same node for the same pipeline job would be problematic if you got to the end of the node block and it killed the other node blocks running at the same time.

            However, what this does mean is that nothing is actually cleaned up except what is normally killed by parent processes exiting.  On Windows, this is more problematic because the ease of breaking parent->child relationships

            Show
            mmitche Matthew Mitchell added a comment - Actually it does have user visible behavior.  I did a bunch of experimentation and instrumentation of Jenkins. The process killer is looking for the original value of JENKINS_SERVER_COOKE (some hash). But, all processes within the node get the durable-<hash>, including the problematic ones I'm seeing. I understand the intention here.  Multiple node {} blocks running concurrently on the same node for the same pipeline job would be problematic if you got to the end of the node block and it killed the other node blocks running at the same time. However, what this does mean is that nothing is actually cleaned up except what is normally killed by parent processes exiting.  On Windows, this is more problematic because the ease of breaking parent->child relationships
            Hide
            jglick Jesse Glick added a comment -

            IOW that the killer logic is just broken. I would need to spend some time on it in a debugger to confirm.

            Show
            jglick Jesse Glick added a comment - IOW that the killer logic is just broken. I would need to spend some time on it in a debugger to confirm.
            Hide
            mmitche Matthew Mitchell added a comment -

            I can take this over the next day or two (it's necessary for us to roll this out).  It might just mean building more logic about what the process env var match needs to be when dealing with pipeline jobs.

            Show
            mmitche Matthew Mitchell added a comment - I can take this over the next day or two (it's necessary for us to roll this out).  It might just mean building more logic about what the process env var match needs to be when dealing with pipeline jobs.
            Hide
            jglick Jesse Glick added a comment -

            I suggested the probable fix in the issue description, along with a sketch of a test.

            Show
            jglick Jesse Glick added a comment - I suggested the probable fix in the issue description, along with a sketch of a test.
            Hide
            mmitche Matthew Mitchell added a comment -

            Jesse Glick your hunch was correct.  Fixed in https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/39 and added test case.

            Show
            mmitche Matthew Mitchell added a comment - Jesse Glick your hunch was correct.  Fixed in https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/39  and added test case.
            Hide
            mmitche Matthew Mitchell added a comment -

            Also, as an aside, the original motivation for this was to fix our workspace cleanup call at the end of node steps.  We were having processes left around.  However, after fixing this bug, it's clear that this won't fix that issue, since we actually need to have the cleanup happen after killing child processes.  Is there a hook (through plugin or otherwise) that can ensure that the WsCleanup runs after process cleanup?

            Show
            mmitche Matthew Mitchell added a comment - Also, as an aside, the original motivation for this was to fix our workspace cleanup call at the end of node steps.  We were having processes left around.  However, after fixing this bug, it's clear that this won't fix that issue, since we actually need to have the cleanup happen after killing child processes.  Is there a hook (through plugin or otherwise) that can ensure that the WsCleanup runs after process cleanup?
            Hide
            jglick Jesse Glick added a comment -

            Well the process cleanup should happen at the end of the sh/bat step, in addition to or instead of at the end of the node block.

            Show
            jglick Jesse Glick added a comment - Well the process cleanup should happen at the end of the sh / bat step, in addition to or instead of at the end of the node block.
            Hide
            mmitche Matthew Mitchell added a comment -

            Interesting.  I don't think I was seeing it at the end of an sh/bat step.  Let me check that.  But you're right, this should fix the overall issue if it was killed at the end of each sh/bat

            Show
            mmitche Matthew Mitchell added a comment - Interesting.  I don't think I was seeing it at the end of an sh/bat step.  Let me check that.  But you're right, this should fix the overall issue if it was killed at the end of each sh/bat
            Hide
            mmitche Matthew Mitchell added a comment -

            Jesse Glick The kill here is in ExecutorStepExecution, which is scoped to the node level, not the step level, so the process killing happens at the end of the node block.  There is no kill for DurableStep currently.  My take on this is that this would be the desired behavior, but I'm open to changing it.  This would certainly solve the problem, but might have unintended consequences for some user's pipeline workflows.

            Show
            mmitche Matthew Mitchell added a comment - Jesse Glick  The kill here is in ExecutorStepExecution, which is scoped to the node level, not the step level, so the process killing happens at the end of the node block.  There is no kill for DurableStep currently.  My take on this is that this would be the desired behavior, but I'm open to changing it.  This would certainly solve the problem, but might have unintended consequences for some user's pipeline workflows.
            Hide
            jglick Jesse Glick added a comment -

            Hmm, probably FileMonitoringController.cleanup should be doing this.

            Show
            jglick Jesse Glick added a comment - Hmm, probably FileMonitoringController.cleanup should be doing this.
            Hide
            relm Reagan Elm added a comment -

            Matthew Mitchell is there any way to opt-out of Launcher.kill for a specific node? Some of our jobs intentionally start log running scripts we need to survive after the node has completed.

            Show
            relm Reagan Elm added a comment - Matthew Mitchell is there any way to opt-out of Launcher.kill for a specific node? Some of our jobs intentionally start log running scripts we need to survive after the node has completed.
            Hide
            jglick Jesse Glick added a comment -

            Clear the environment I suppose.

            Show
            jglick Jesse Glick added a comment - Clear the environment I suppose.
            Hide
            mmitche Matthew Mitchell added a comment -

            The documented way was always to alter the environment variable that is being checked, so for instance, in Freestyle jobs you'd do:

            JENKINS_SERVER_COOKIE=do_not_kill

            in the pipeline jobs you'd do:

            JENKINS_NODE_COOKIE=do_not_kill

            Show
            mmitche Matthew Mitchell added a comment - The documented way was always to alter the environment variable that is being checked, so for instance, in Freestyle jobs you'd do: JENKINS_SERVER_COOKIE=do_not_kill in the pipeline jobs you'd do: JENKINS_NODE_COOKIE=do_not_kill
            Hide
            mykola Mykola Marzhan added a comment -

            Hi all,

            sorry for reopening, I just received new jenkins update...

            Is it possible to don't reinvent variable names, but use the stable well-known approach from usual jobs?
            I am about BUILD_ID=dontKillMe variable.
            documented here https://wiki.jenkins.io/display/JENKINS/ProcessTreeKiller 

            everyone who uses jenkins a lot – know about this variable and it is extremely simple to receive info about this variable and process killing via searching.
            it is very strange and unusual to have different variables for old-school jobs and for pipelines.

             

             

            Show
            mykola Mykola Marzhan added a comment - Hi all, sorry for reopening, I just received new jenkins update... Is it possible to don't reinvent variable names, but use the stable well-known approach from usual jobs? I am about BUILD_ID=dontKillMe variable. documented here  https://wiki.jenkins.io/display/JENKINS/ProcessTreeKiller   everyone who uses jenkins a lot – know about this variable and it is extremely simple to receive info about this variable and process killing via searching. it is very strange and unusual to have different variables for old-school jobs and for pipelines.    
            Hide
            mmitche Matthew Mitchell added a comment -

            I don't think BUILD_ID was being used for killing processes before.

            Anyways, the classic approach doesn't work with pipelines.  Two executors on the same machine could run parts of the same build in parallel.  This means that when the process killer attempts to kill BUILD_ID it will kill the other executor's processes.

            We could perhaps introduce an additional environment variable, and check for both.  Thoughts Jesse Glick?

            Show
            mmitche Matthew Mitchell added a comment - I don't think BUILD_ID was being used for killing processes before. Anyways, the classic approach doesn't work with pipelines.  Two executors on the same machine could run parts of the same build in parallel.  This means that when the process killer attempts to kill BUILD_ID it will kill the other executor's processes. We could perhaps introduce an additional environment variable, and check for both.  Thoughts Jesse Glick ?
            Hide
            jglick Jesse Glick added a comment -

            Does not need to be reopened.

            I see no reason to look for BUILD_ID. As pointed out, this does not generalize well to parallel blocks.

            Show
            jglick Jesse Glick added a comment - Does not need to be reopened. I see no reason to look for BUILD_ID . As pointed out, this does not generalize well to parallel blocks.
            Hide
            mykola Mykola Marzhan added a comment -

            Hi Jesse Glick
            it is not needed to look only to BUILD_ID but it is possible to look also to BUILD_ID.
            it means kill process only in case if JENKINS_NODE_COOKIE and BUILD_ID are unchanged.

            I know that pipelines are a completely new thing, but jenkins itself is not a new thing and it very important to keep backward compatibility.
            Many people work with jenkins many years and these new variables are completely unexpected. 

            Show
            mykola Mykola Marzhan added a comment - Hi  Jesse Glick it is not needed to look only to BUILD_ID but it is possible to look also to BUILD_ID . it means kill process only in case if JENKINS_NODE_COOKIE and BUILD_ID are unchanged. I know that pipelines are a completely new thing, but jenkins itself is not a new thing and it very important to keep backward compatibility. Many people work with jenkins many years and these new variables are completely unexpected. 
            Hide
            jglick Jesse Glick added a comment -

            Could be a follow-up RFE (linked issue).

            Show
            jglick Jesse Glick added a comment - Could be a follow-up RFE (linked issue).

              People

              • Assignee:
                mmitche Matthew Mitchell
                Reporter:
                jglick Jesse Glick
              • Votes:
                1 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: