Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27565

Nodes can be removed as idle before the assigned tasks have started

    Details

    • Similar Issues:

      Description

      Quite a number of different manifestations of this observed by a number of our customers using different cloud providers. In common is the use of a "single-shot" style retention strategy, though the root cause is observable with great care when using any retention strategy other than Always.

      The basic issue is that you cannot determine if a node is idle unless you hold the Queue lock as that is the only way to ensure that the Queue is not in the process of assigning work to the node you are removing.

      Symptoms include:

      • Build logs that claim the job was executed on "master" even though the job is tied to a specific label that master does not have. The build log will have been "unable to be determined"
      • Build logs where the node is gone just as soon as the job starts
        2015-03-05 13:27:55.101 Started by upstream project "____" build number ___ 
        2015-03-05 13:27:55.102 originally caused by: 
        2015-03-05 13:27:55.103 Started by user ____ 
        2015-03-05 13:27:55.437 FATAL: no longer a configured node for ____ 
        2015-03-05 13:27:55.440 java.lang.IllegalStateException: no longer a configured node for ____ 
        2015-03-05 13:27:55.440 at hudson.model.AbstractBuild$AbstractBuildExecution.getCurrentNode(AbstractBuild.java:452) 
        2015-03-05 13:27:55.440 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:484) 
        2015-03-05 13:27:55.441 at hudson.model.Run.execute(Run.java:1745) 
        2015-03-05 13:27:55.441 at hudson.model.Build.run(Build.java:113) 
        2015-03-05 13:27:55.441 at hudson.model.ResourceController.execute(ResourceController.java:89) 
        2015-03-05 13:27:55.441 at hudson.model.Executor.run(Executor.java:240)
        

        Attachments

          Issue Links

            Activity

            stephenconnolly Stephen Connolly created issue -
            jglick Jesse Glick made changes -
            Field Original Value New Value
            Status Open [ 1 ] In Progress [ 3 ]
            jglick Jesse Glick made changes -
            Remote Link This issue links to "PR 1596 (Web Link)" [ 12179 ]
            jglick Jesse Glick made changes -
            Labels queue slave threads
            jglick Jesse Glick made changes -
            Link This issue is blocking JENKINS-20046 [ JENKINS-20046 ]
            pablaasmo Per Arnold Blaasmo made changes -
            Link This issue is blocking JENKINS-27476 [ JENKINS-27476 ]
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/Functions.java
            core/src/main/java/hudson/model/AbstractCIBase.java
            core/src/main/java/hudson/model/Computer.java
            core/src/main/java/hudson/model/Executor.java
            core/src/main/java/hudson/model/Hudson.java
            core/src/main/java/hudson/model/Node.java
            core/src/main/java/hudson/model/Queue.java
            core/src/main/java/hudson/model/ResourceController.java
            core/src/main/java/hudson/slaves/AbstractCloudSlave.java
            core/src/main/java/hudson/slaves/ComputerRetentionWork.java
            core/src/main/java/hudson/slaves/NodeProvisioner.java
            core/src/main/java/hudson/slaves/RetentionStrategy.java
            core/src/main/java/hudson/slaves/SlaveComputer.java
            core/src/main/java/jenkins/model/Jenkins.java
            core/src/main/java/jenkins/model/Nodes.java
            core/src/main/java/jenkins/util/AtmostOneTaskExecutor.java
            core/src/main/resources/hudson/model/Messages.properties
            core/src/main/resources/lib/hudson/executors.jelly
            core/src/main/resources/lib/layout/layout.jelly
            http://jenkins-ci.org/commit/jenkins/92147c3597308bc05e6448ccc41409fcc7c05fd7
            Log:
            [FIXED JENKINS-27565] Refactor the Queue and Nodes to use a consistent locking strategy

            The test system I set up to verify resolution of customer(s)' issues driving this change, required
            additional changes in order to fully resolve the issues at hand. As a result I am bundling these
            changes:

            • Moves nodes to being store in separate config files outside of the main config file (improves performance) [FIXED JENKINS-27562]
            • Makes the Jenkins is loading screen not block on the extensions loading lock [FIXED JENKINS-27563]
            • Removes race condition rendering the list of executors [FIXED JENKINS-27564] [FIXED JENKINS-15355]
            • Tidy up the locks that were causing deadlocks with the once retention strategy in durable tasks [FIXED JENKINS-27476]
            • Remove any requirement from Jenkins Core to lock on the Queue when rendering the Jenkins UI [FIXED-JENKINS-27566]
            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/Functions.java core/src/main/java/hudson/model/AbstractCIBase.java core/src/main/java/hudson/model/Computer.java core/src/main/java/hudson/model/Executor.java core/src/main/java/hudson/model/Hudson.java core/src/main/java/hudson/model/Node.java core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/ResourceController.java core/src/main/java/hudson/slaves/AbstractCloudSlave.java core/src/main/java/hudson/slaves/ComputerRetentionWork.java core/src/main/java/hudson/slaves/NodeProvisioner.java core/src/main/java/hudson/slaves/RetentionStrategy.java core/src/main/java/hudson/slaves/SlaveComputer.java core/src/main/java/jenkins/model/Jenkins.java core/src/main/java/jenkins/model/Nodes.java core/src/main/java/jenkins/util/AtmostOneTaskExecutor.java core/src/main/resources/hudson/model/Messages.properties core/src/main/resources/lib/hudson/executors.jelly core/src/main/resources/lib/layout/layout.jelly http://jenkins-ci.org/commit/jenkins/92147c3597308bc05e6448ccc41409fcc7c05fd7 Log: [FIXED JENKINS-27565] Refactor the Queue and Nodes to use a consistent locking strategy The test system I set up to verify resolution of customer(s)' issues driving this change, required additional changes in order to fully resolve the issues at hand. As a result I am bundling these changes: Moves nodes to being store in separate config files outside of the main config file (improves performance) [FIXED JENKINS-27562] Makes the Jenkins is loading screen not block on the extensions loading lock [FIXED JENKINS-27563] Removes race condition rendering the list of executors [FIXED JENKINS-27564] [FIXED JENKINS-15355] Tidy up the locks that were causing deadlocks with the once retention strategy in durable tasks [FIXED JENKINS-27476] Remove any requirement from Jenkins Core to lock on the Queue when rendering the Jenkins UI [FIXED-JENKINS-27566]
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/Functions.java
            core/src/main/java/hudson/model/AbstractCIBase.java
            core/src/main/java/hudson/model/Computer.java
            core/src/main/java/hudson/model/Executor.java
            core/src/main/java/hudson/model/Hudson.java
            core/src/main/java/hudson/model/Node.java
            core/src/main/java/hudson/model/Queue.java
            core/src/main/java/hudson/model/ResourceController.java
            core/src/main/java/hudson/slaves/AbstractCloudSlave.java
            core/src/main/java/hudson/slaves/CloudRetentionStrategy.java
            core/src/main/java/hudson/slaves/CloudSlaveRetentionStrategy.java
            core/src/main/java/hudson/slaves/ComputerRetentionWork.java
            core/src/main/java/hudson/slaves/NodeProvisioner.java
            core/src/main/java/hudson/slaves/RetentionStrategy.java
            core/src/main/java/hudson/slaves/SimpleScheduledRetentionStrategy.java
            core/src/main/java/hudson/slaves/SlaveComputer.java
            core/src/main/java/jenkins/model/Jenkins.java
            core/src/main/java/jenkins/model/Nodes.java
            core/src/main/java/jenkins/util/AtmostOneTaskExecutor.java
            core/src/main/resources/hudson/model/Messages.properties
            core/src/main/resources/lib/hudson/executors.jelly
            core/src/main/resources/lib/layout/layout.jelly
            test/src/test/groovy/hudson/model/AbstractProjectTest.groovy
            test/src/test/java/hudson/model/ExecutorTest.java
            test/src/test/java/hudson/model/GetEnvironmentOutsideBuildTest.java
            test/src/test/java/hudson/model/QueueTest.java
            test/src/test/java/jenkins/model/JenkinsReloadConfigurationTest.java
            http://jenkins-ci.org/commit/jenkins/ecac963eaff0608accf950d90d75cff8b66bdc4c
            Log:
            Merge pull request #1596 from stephenc/threadsafe-node-queue

            JENKINS-27565 Fix threading issues with Nodes and Queue

            Compare: https://github.com/jenkinsci/jenkins/compare/1c781526a644...ecac963eaff0

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/Functions.java core/src/main/java/hudson/model/AbstractCIBase.java core/src/main/java/hudson/model/Computer.java core/src/main/java/hudson/model/Executor.java core/src/main/java/hudson/model/Hudson.java core/src/main/java/hudson/model/Node.java core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/ResourceController.java core/src/main/java/hudson/slaves/AbstractCloudSlave.java core/src/main/java/hudson/slaves/CloudRetentionStrategy.java core/src/main/java/hudson/slaves/CloudSlaveRetentionStrategy.java core/src/main/java/hudson/slaves/ComputerRetentionWork.java core/src/main/java/hudson/slaves/NodeProvisioner.java core/src/main/java/hudson/slaves/RetentionStrategy.java core/src/main/java/hudson/slaves/SimpleScheduledRetentionStrategy.java core/src/main/java/hudson/slaves/SlaveComputer.java core/src/main/java/jenkins/model/Jenkins.java core/src/main/java/jenkins/model/Nodes.java core/src/main/java/jenkins/util/AtmostOneTaskExecutor.java core/src/main/resources/hudson/model/Messages.properties core/src/main/resources/lib/hudson/executors.jelly core/src/main/resources/lib/layout/layout.jelly test/src/test/groovy/hudson/model/AbstractProjectTest.groovy test/src/test/java/hudson/model/ExecutorTest.java test/src/test/java/hudson/model/GetEnvironmentOutsideBuildTest.java test/src/test/java/hudson/model/QueueTest.java test/src/test/java/jenkins/model/JenkinsReloadConfigurationTest.java http://jenkins-ci.org/commit/jenkins/ecac963eaff0608accf950d90d75cff8b66bdc4c Log: Merge pull request #1596 from stephenc/threadsafe-node-queue JENKINS-27565 Fix threading issues with Nodes and Queue Compare: https://github.com/jenkinsci/jenkins/compare/1c781526a644...ecac963eaff0
            Hide
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #4033
            [FIXED JENKINS-27565] Refactor the Queue and Nodes to use a consistent locking strategy (Revision 92147c3597308bc05e6448ccc41409fcc7c05fd7)

            Result = UNSTABLE
            stephen connolly : 92147c3597308bc05e6448ccc41409fcc7c05fd7
            Files :

            • core/src/main/java/hudson/model/Executor.java
            • core/src/main/java/hudson/slaves/SlaveComputer.java
            • core/src/main/java/hudson/slaves/AbstractCloudSlave.java
            • core/src/main/java/hudson/slaves/RetentionStrategy.java
            • core/src/main/java/jenkins/util/AtmostOneTaskExecutor.java
            • core/src/main/java/hudson/model/Queue.java
            • core/src/main/resources/lib/hudson/executors.jelly
            • core/src/main/java/hudson/Functions.java
            • core/src/main/java/hudson/model/Node.java
            • core/src/main/java/hudson/model/ResourceController.java
            • core/src/main/java/hudson/model/AbstractCIBase.java
            • core/src/main/java/jenkins/model/Jenkins.java
            • core/src/main/resources/hudson/model/Messages.properties
            • core/src/main/java/hudson/model/Computer.java
            • core/src/main/java/hudson/slaves/ComputerRetentionWork.java
            • core/src/main/java/hudson/slaves/NodeProvisioner.java
            • core/src/main/java/jenkins/model/Nodes.java
            • core/src/main/resources/lib/layout/layout.jelly
            • core/src/main/java/hudson/model/Hudson.java
            Show
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #4033 [FIXED JENKINS-27565] Refactor the Queue and Nodes to use a consistent locking strategy (Revision 92147c3597308bc05e6448ccc41409fcc7c05fd7) Result = UNSTABLE stephen connolly : 92147c3597308bc05e6448ccc41409fcc7c05fd7 Files : core/src/main/java/hudson/model/Executor.java core/src/main/java/hudson/slaves/SlaveComputer.java core/src/main/java/hudson/slaves/AbstractCloudSlave.java core/src/main/java/hudson/slaves/RetentionStrategy.java core/src/main/java/jenkins/util/AtmostOneTaskExecutor.java core/src/main/java/hudson/model/Queue.java core/src/main/resources/lib/hudson/executors.jelly core/src/main/java/hudson/Functions.java core/src/main/java/hudson/model/Node.java core/src/main/java/hudson/model/ResourceController.java core/src/main/java/hudson/model/AbstractCIBase.java core/src/main/java/jenkins/model/Jenkins.java core/src/main/resources/hudson/model/Messages.properties core/src/main/java/hudson/model/Computer.java core/src/main/java/hudson/slaves/ComputerRetentionWork.java core/src/main/java/hudson/slaves/NodeProvisioner.java core/src/main/java/jenkins/model/Nodes.java core/src/main/resources/lib/layout/layout.jelly core/src/main/java/hudson/model/Hudson.java
            Hide
            jglick Jesse Glick added a comment -

            I think this can be closed as Fixed now, right?

            Show
            jglick Jesse Glick added a comment - I think this can be closed as Fixed now, right?
            stephenconnolly Stephen Connolly made changes -
            Status In Progress [ 3 ] Resolved [ 5 ]
            Resolution Fixed [ 1 ]
            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: changelog.html http://jenkins-ci.org/commit/jenkins/46dc6850edb1d7ef52592794b15e69db7dfbed1a Log: Noting merges JENKINS-15355 JENKINS-21618 JENKINS-22941 JENKINS-25938 JENKINS-26391 JENKINS-26900 JENKINS-27476 JENKINS-27563 JENKINS-27564 JENKINS-27565 JENKINS-27566 Fixing link text for JENKINS-6167
            jglick Jesse Glick made changes -
            Link This issue depends on JENKINS-27700 [ JENKINS-27700 ]
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            changelog.html
            http://jenkins-ci.org/commit/jenkins/3e88ea26c3c6427651d377f5fca1c5390ddc92a4
            Log:
            JENKINS-27700 Noting that JENKINS-27565 changed settings format.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: changelog.html http://jenkins-ci.org/commit/jenkins/3e88ea26c3c6427651d377f5fca1c5390ddc92a4 Log: JENKINS-27700 Noting that JENKINS-27565 changed settings format.
            oleg_nenashev Oleg Nenashev made changes -
            Link This issue is related to JENKINS-27708 [ JENKINS-27708 ]
            Hide
            jyrmyx Jyri Ilama added a comment - - edited

            Any chance that this fix prevents builds from finishing, if e.g. azure slave clean task thread is running? This thread might take pretty long sometimes.

            I just updated from 1.606 to 1.612 and found this issue. All builds freeze in the end before triggering a downstream job. The funny part is that the builds don't even show to be running anymore in the project view, but when you open console output, you'll find the build hanging. But, when checking from Jenkins main page, you can see them still running. When the clean task thread finishes, the builds finish too.

            Show
            jyrmyx Jyri Ilama added a comment - - edited Any chance that this fix prevents builds from finishing, if e.g. azure slave clean task thread is running? This thread might take pretty long sometimes. I just updated from 1.606 to 1.612 and found this issue. All builds freeze in the end before triggering a downstream job. The funny part is that the builds don't even show to be running anymore in the project view, but when you open console output, you'll find the build hanging. But, when checking from Jenkins main page, you can see them still running. When the clean task thread finishes, the builds finish too.
            Hide
            phuang Peter Huang added a comment -

            I have come across the same problem.

            Show
            phuang Peter Huang added a comment - I have come across the same problem.
            Hide
            stephenconnolly Stephen Connolly added a comment -

            I have taken a look at the Azure cloud implementation... why did I do that? I knew what I was going to find before I looked... oh yes yet another completely borked cloud implementation which does everything hacky wrong ways because nobody actually knows the correct way to implement a cloud provider in Jenkins.

            I would not be surprised if the AzureSlaveCleanTaskThread causes issues, it seems to be doing lots of things it shouldn't... I really need to get the time to implement a clean cloud provisioning API so that people will stop using the current completely broken one (and just so that people who have tried to implement the Cloud API feel better, I have not seen any cloud implementation that is 100% correct... largely because the existing API is so woefully underspecified)

            Show
            stephenconnolly Stephen Connolly added a comment - I have taken a look at the Azure cloud implementation... why did I do that? I knew what I was going to find before I looked... oh yes yet another completely borked cloud implementation which does everything hacky wrong ways because nobody actually knows the correct way to implement a cloud provider in Jenkins. I would not be surprised if the AzureSlaveCleanTaskThread causes issues, it seems to be doing lots of things it shouldn't... I really need to get the time to implement a clean cloud provisioning API so that people will stop using the current completely broken one (and just so that people who have tried to implement the Cloud API feel better, I have not seen any cloud implementation that is 100% correct... largely because the existing API is so woefully underspecified)
            szubster Tomasz Szuba made changes -
            Link This issue is related to JENKINS-28690 [ JENKINS-28690 ]
            jglick Jesse Glick made changes -
            Link This issue is related to JENKINS-20967 [ JENKINS-20967 ]
            jglick Jesse Glick made changes -
            Link This issue depends on JENKINS-32517 [ JENKINS-32517 ]
            stephenconnolly Stephen Connolly made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 161775 ] JNJira + In-Review [ 208569 ]
            jglick Jesse Glick made changes -
            Link This issue relates to JENKINS-56403 [ JENKINS-56403 ]
            Hide
            runzexia runze xia added a comment - - edited

            Stephen Connolly Do you know which version is included in this fix? I also encountered this problem using jenkins core v.2176.2 with kubernetes plugin

            Show
            runzexia runze xia added a comment - - edited Stephen Connolly Do you know which version is included in this fix? I also encountered this problem using jenkins core v.2176.2 with kubernetes plugin

              People

              • Assignee:
                stephenconnolly Stephen Connolly
                Reporter:
                stephenconnolly Stephen Connolly
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: