Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-36013

Automatically abort ExecutorPickle rehydration from an ephemeral node

    Details

    • Similar Issues:
    • Sprint:
      Pipeline - July/August

      Description

      ExecutorPickle.rehydrate ought to be able to detect that it has been spinning in circles because the agent node it was supposed to run on is not in the Jenkins node list, and automatically abort, causing the build to fail with a comprehensible message rather than just hanging indefinitely. (As opposed to being registered but offline, which is normal enough for a JNLP agent etc.—in such cases we just want to wait for the agent to come back online.)

      This would provide a better experience for the case of a build which was running on an EphemeralNode (such as from a Cloud without durable-task integration) when Jenkins was restarted. An agent using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be terminated. Similarly, there may be cases where the agent is actually going to be redefined (with the same name) when it is attached after the restart—not sure about the Swarm plugin, but CloudBees DEV@cloud OPEs work this way. To prevent the build from being killed too aggressively, the cleanup should be delayed until some time has elapsed since rehydration began (or, ideally, since Jenkins completed initialization)—say, five minutes.

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            Originally suggested in JENKINS-26130 but I felt it was better to split this out. JENKINS-26130 does at least provide a much more comprehensible diagnosis for the problem.

            Show
            jglick Jesse Glick added a comment - Originally suggested in JENKINS-26130 but I felt it was better to split this out. JENKINS-26130 does at least provide a much more comprehensible diagnosis for the problem.
            Hide
            hrmpw Patrick Wolf added a comment -

            Sounds related to what R. Tyler Croy is experiencing with JENKINS-41569.

            Show
            hrmpw Patrick Wolf added a comment - Sounds related to what R. Tyler Croy is experiencing with JENKINS-41569 .
            Hide
            anomalizer Arvind Jayaprakash added a comment -

            FWIW, I am experiencing the same issue with the mesos cloud provider.

            Show
            anomalizer Arvind Jayaprakash added a comment - FWIW, I am experiencing the same issue with the mesos cloud provider.
            Hide
            michaelneale Michael Neale added a comment -

            Jesse Glick did you have any WIP on this or is someone free to take a run at this? as it is biting a few people people now (not mentioned on this ticket)

            Show
            michaelneale Michael Neale added a comment - Jesse Glick did you have any WIP on this or is someone free to take a run at this? as it is biting a few people people now (not mentioned on this ticket)
            Hide
            jglick Jesse Glick added a comment -

            I had nothing in progress; up for grabs. The tricky bit is identifying the case you are running under reliably. If you are actually dealing with an EphemeralNode, probably the ExecutorPickle could record that fact and fail promptly upon rehydration. The harder cases are nodes that get reattached by some outside process sometime after start—you need to give them a grace period.

            Show
            jglick Jesse Glick added a comment - I had nothing in progress; up for grabs. The tricky bit is identifying the case you are running under reliably. If you are actually dealing with an EphemeralNode , probably the ExecutorPickle could record that fact and fail promptly upon rehydration. The harder cases are nodes that get reattached by some outside process sometime after start—you need to give them a grace period.
            Hide
            michaelneale Michael Neale added a comment -

            thanks Jesse Glick yes on a hangout that conclusion was made to - to have some grace period for things that happen outside, that should be fine. 

            Show
            michaelneale Michael Neale added a comment - thanks Jesse Glick yes on a hangout that conclusion was made to - to have some grace period for things that happen outside, that should be fine. 
            Hide
            svanoort Sam Van Oort added a comment -

            I have a testcase that reproduces the issue and a code fix kills the build.  Working on adding the timeout to that solution, and looking at ways to limit the scope of builds that are killed to avoid it getting overeager.

            Show
            svanoort Sam Van Oort added a comment - I have a testcase that reproduces the issue and a code fix kills the build.  Working on adding the timeout to that solution, and looking at ways to limit the scope of builds that are killed to avoid it getting overeager.
            Hide
            michaelneale Michael Neale added a comment -
            Show
            michaelneale Michael Neale added a comment - nice Sam Van Oort !
            Hide
            svanoort Sam Van Oort added a comment -

            FYI I've cut beta 2.14-beta to try out of this if people want try it

            Show
            svanoort Sam Van Oort added a comment - FYI I've cut beta 2.14-beta to try out of this if people want try it
            Hide
            svanoort Sam Van Oort added a comment -

            On second thought, hold on, I'm going to cut a different one. 

            Show
            svanoort Sam Van Oort added a comment - On second thought, hold on, I'm going to cut a different one. 
            Hide
            svanoort Sam Van Oort added a comment -

            Fix released with durable task step 2.14

            Show
            svanoort Sam Van Oort added a comment - Fix released with durable task step 2.14
            Hide
            basil Basil Crow added a comment - - edited

            After upgrading from 2.13 to 2.14, my pipeline jobs that use the swarm plugin can't survive a restart anymore.

            Under 2.13, my pipeline jobs were able to survive a restart if I used the -deleteExistingClients option that the swarm plugin has. Without -deleteExistingClients, I used to get this:

            SEVERE: RetryException occurred
            hudson.plugins.swarm.RetryException: Failed to create a slave on Jenkins, response code: 409
            A slave called 'create-dc-slave-1-eff65ede' already exists and legacy clients do not support name disambiguation
            
                    at hudson.plugins.swarm.SwarmClient.createSwarmSlave(SwarmClient.java:448)
                    at hudson.plugins.swarm.Client.run(Client.java:134)
                    at hudson.plugins.swarm.Client.main(Client.java:87)
            

            But this was basically working once I added -deleteExistingClients:

            Resuming build at Fri Aug 25 23:54:47 UTC 2017 after Jenkins restart
            Waiting to resume part of test #5: create-dc-slave-1-eff65ede is offline
            Waiting to resume part of test #5: There are no nodes with the label ‘create-dc-slave-1-eff65ede’
            Waiting to resume part of test #5: Jenkins doesn’t have label create-dc-slave-1-eff65ede
            create-dc-slave-1-eff65ede is offline
            Ready to run at Fri Aug 25 23:54:58 UTC 2017
            

            Now, after upgrading to 2.14, this doesn't work at all:

            Resuming build at Sat Aug 26 00:57:51 UTC 2017 after Jenkins restart
            [Pipeline] End of Pipeline
            ERROR: Killed hudson.model.Queue$WaitingItem:ExecutorStepExecution.PlaceholderTask{runId=test#6,label=create-dc-slave-1-eff65ede,context=CpsStepContext[3:null]:Owner[test/6:test #6],cookie=da72341a-9bff-4c9d-9607-8ff0c2e81595}:29 because EphemeralNode create-dc-slave-1-eff65ede is never going to reappear, by definition!
            Finished: FAILURE
            

            Since I rely on the swarm plugin in my pipeline jobs, I view this as a major regression. It would be nice to have this functionality restored.

            Also note that I'm using the latest version of the swarm plugin and swarm client (3.4 on both ends).

            Show
            basil Basil Crow added a comment - - edited After upgrading from 2.13 to 2.14, my pipeline jobs that use the swarm plugin can't survive a restart anymore. Under 2.13, my pipeline jobs were able to survive a restart if I used the -deleteExistingClients option that the swarm plugin has. Without -deleteExistingClients , I used to get this: SEVERE: RetryException occurred hudson.plugins.swarm.RetryException: Failed to create a slave on Jenkins, response code: 409 A slave called 'create-dc-slave-1-eff65ede' already exists and legacy clients do not support name disambiguation at hudson.plugins.swarm.SwarmClient.createSwarmSlave(SwarmClient.java:448) at hudson.plugins.swarm.Client.run(Client.java:134) at hudson.plugins.swarm.Client.main(Client.java:87) But this was basically working once I added -deleteExistingClients : Resuming build at Fri Aug 25 23:54:47 UTC 2017 after Jenkins restart Waiting to resume part of test #5: create-dc-slave-1-eff65ede is offline Waiting to resume part of test #5: There are no nodes with the label ‘create-dc-slave-1-eff65ede’ Waiting to resume part of test #5: Jenkins doesn’t have label create-dc-slave-1-eff65ede create-dc-slave-1-eff65ede is offline Ready to run at Fri Aug 25 23:54:58 UTC 2017 Now, after upgrading to 2.14, this doesn't work at all: Resuming build at Sat Aug 26 00:57:51 UTC 2017 after Jenkins restart [Pipeline] End of Pipeline ERROR: Killed hudson.model.Queue$WaitingItem:ExecutorStepExecution.PlaceholderTask{runId=test#6,label=create-dc-slave-1-eff65ede,context=CpsStepContext[3:null]:Owner[test/6:test #6],cookie=da72341a-9bff-4c9d-9607-8ff0c2e81595}:29 because EphemeralNode create-dc-slave-1-eff65ede is never going to reappear, by definition! Finished: FAILURE Since I rely on the swarm plugin in my pipeline jobs, I view this as a major regression. It would be nice to have this functionality restored. Also note that I'm using the latest version of the swarm plugin and swarm client (3.4 on both ends).
            Hide
            svanoort Sam Van Oort added a comment - - edited

            Basil Crow I am sorry to hear that this caused a regression for you – it appears to be an unanticipated case where EphemeralNodes generated by the Swarm Plugin aren't really following the contract of that interface, since they can reconnect and be recreated but will not do so immediately. 

             I have a fix supplied here – https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/48 – pending review by Jesse Glick.  This will apply a 5 minute timeout, restoring my original implementation strategy. 

            Show
            svanoort Sam Van Oort added a comment - - edited Basil Crow I am sorry to hear that this caused a regression for you – it appears to be an unanticipated case where EphemeralNodes generated by the Swarm Plugin aren't really following the contract of that interface, since they can reconnect and be recreated but will not do so immediately.   I have a fix supplied here – https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/48 – pending review by Jesse Glick .  This will apply a 5 minute timeout, restoring my original implementation strategy. 
            Hide
            jglick Jesse Glick added a comment -

            Basil Crow sounds like a bug in the Swarm plugin to me. The implementation of JENKINS-34593 ought to have removed the EphemeralNode marker if I understand it correctly. The description of -deleteExistingClients does not seem to talk about retaining agents across restarts but it seems that you have discovered that as a use case—probably not a tested one. (As an aside, it seems the plugin contains no tests which actually run Jenkins, much less tests of Pipeline interoperability or of restart behavior.)

            Show
            jglick Jesse Glick added a comment - Basil Crow sounds like a bug in the Swarm plugin to me. The implementation of  JENKINS-34593 ought to have removed the EphemeralNode marker if I understand it correctly. The description of -deleteExistingClients does not seem to talk about retaining agents across restarts but it seems that you have discovered that as a use case—probably not a tested one. (As an aside, it seems the plugin contains no tests which actually run Jenkins, much less tests of Pipeline interoperability or of restart behavior.)
            Hide
            basil Basil Crow added a comment -

            Sam Van Oort and Jesse Glick, I wanted to say thank you for releasing version 2.15 which restores this functionality. I tested it, and my pipeline jobs that use the swarm plugin once again survive Jenkins restarts. Thanks!

            Show
            basil Basil Crow added a comment - Sam Van Oort and Jesse Glick , I wanted to say thank you for releasing version 2.15 which restores this functionality. I tested it, and my pipeline jobs that use the swarm plugin once again survive Jenkins restarts. Thanks!
            Hide
            michaelneale Michael Neale added a comment -

            hi five Sam Van Oort (but make sure he has washed his hands, he just finished coming second in a chilli eating competition)!

            Show
            michaelneale Michael Neale added a comment - hi five Sam Van Oort (but make sure he has washed his hands, he just finished coming second in a chilli eating competition)!
            Hide
            svanoort Sam Van Oort added a comment -

            Basil Crow Thanks!  I'm glad you're finding it works well for you now

            Show
            svanoort Sam Van Oort added a comment - Basil Crow Thanks!  I'm glad you're finding it works well for you now

              People

              • Assignee:
                svanoort Sam Van Oort
                Reporter:
                jglick Jesse Glick
              • Votes:
                6 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: