Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-40771

Race condition in FlowExecutionList

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Hi,

      We found a potential bug that can only be replicated in pipeline jobs. Essentially when a job a running and a Jenkins restart occurs, the job is left hanging infinitely:

      Resuming build at Tue Jan 03 10:37:18 UTC 2017 after Jenkins restart
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      ...
      

      I noticed that this behaviour does not exhibit on any other job types. i.e. freestyle.

      Here is a simple test pipeline script:

      node('XXXXX') {
      
        stage 'Stage 1'
          println 'Deploying to Stage 1...'
      
        stage 'Stage 2'
          println 'Running Tests in Stage 2'
          sleep 120
          println 'Tests passed!'
      
        stage 'Stage 3'
          println 'Deploying to Stage 3...'
      
      }
      

      ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour.

      Currently I am using version 2.3, but I believe this issue was replicated in previous versions.

      Please can you help me explain why this behaviour only exists in pipeline jobs?

      Kind Regards,
      Tuan

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            Tuan Nguyen I ran 2.32.2 on a fresh home dir (Linux / Java 8), installed Pipeline incl. workflow-basic-steps 2.3, workflow-cps 2.26, pipeline-stage-step 2.2, workflow-job 2.9, created a Pipeline job with the script

            node {
            
              stage 'Stage 1'
                println 'Deploying to Stage 1...'
            
              stage 'Stage 2'
                println 'Running Tests in Stage 2'
                sleep 120
                println 'Tests passed!'
            
              stage 'Stage 3'
                println 'Deploying to Stage 3...'
            
            }
            

            ran *Build Now*, waited for the stage view to show the second stage in progress, used /restart to restart Jenkins, and after the restart it completed as expected:

            Started by user admin
            [Pipeline] node
            Running on master in …/workspace/derng
            [Pipeline] {
            [Pipeline] stage (Stage 1)
            Using the ‘stage’ step without a block argument is deprecated
            Entering stage Stage 1
            Proceeding
            [Pipeline] echo
            Deploying to Stage 1...
            [Pipeline] stage (Stage 2)
            Using the ‘stage’ step without a block argument is deprecated
            Entering stage Stage 2
            Proceeding
            [Pipeline] echo
            Running Tests in Stage 2
            [Pipeline] sleep
            Sleeping for 2 min 0 sec
            Resuming build at Fri Feb 10 12:57:13 EST 2017 after Jenkins restart
            Ready to run at Fri Feb 10 12:57:14 EST 2017
            Sleeping for 1 min 34 sec
            [Pipeline] echo
            Tests passed!
            [Pipeline] stage (Stage 3)
            Using the ‘stage’ step without a block argument is deprecated
            Entering stage Stage 3
            Proceeding
            [Pipeline] echo
            Deploying to Stage 3...
            [Pipeline] }
            [Pipeline] // node
            [Pipeline] End of Pipeline
            Finished: SUCCESS
            
            Show
            jglick Jesse Glick added a comment - Tuan Nguyen I ran 2.32.2 on a fresh home dir (Linux / Java 8), installed Pipeline incl. workflow-basic-steps 2.3, workflow-cps 2.26, pipeline-stage-step 2.2, workflow-job 2.9, created a Pipeline job with the script node { stage 'Stage 1' println 'Deploying to Stage 1...' stage 'Stage 2' println 'Running Tests in Stage 2' sleep 120 println 'Tests passed!' stage 'Stage 3' println 'Deploying to Stage 3...' } ran * Build Now *, waited for the stage view to show the second stage in progress, used /restart to restart Jenkins, and after the restart it completed as expected: Started by user admin [Pipeline] node Running on master in …/workspace/derng [Pipeline] { [Pipeline] stage (Stage 1) Using the ‘stage’ step without a block argument is deprecated Entering stage Stage 1 Proceeding [Pipeline] echo Deploying to Stage 1... [Pipeline] stage (Stage 2) Using the ‘stage’ step without a block argument is deprecated Entering stage Stage 2 Proceeding [Pipeline] echo Running Tests in Stage 2 [Pipeline] sleep Sleeping for 2 min 0 sec Resuming build at Fri Feb 10 12:57:13 EST 2017 after Jenkins restart Ready to run at Fri Feb 10 12:57:14 EST 2017 Sleeping for 1 min 34 sec [Pipeline] echo Tests passed! [Pipeline] stage (Stage 3) Using the ‘stage’ step without a block argument is deprecated Entering stage Stage 3 Proceeding [Pipeline] echo Deploying to Stage 3... [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Finished: SUCCESS
            Hide
            jglick Jesse Glick added a comment -

            Matthew Hall on the same configuration (incl. pipeline-build-step 2.4), created job matthall-main:

            Map parallel_jobs = ['branch_1': {build job: 'matthall-50'},
                                 'branch_2': {build job: 'matthall-40'}]
            parallel parallel_jobs
            

            and matthall-50:

            node { sleep(50) }
            

            and matthall-40:

            node { sleep(40) }
            

            Clicked Build Now on matthall-main; went to dashboard to see that matthall-50 and matthall-40 were both building on heavyweight executors; /restart. matthall-40 resumed:

            Started by upstream project "matthall-main" build number 1
            originally caused by:
             Started by user admin
            [Pipeline] node
            Running on master in …/workspace/matthall-40
            [Pipeline] {
            [Pipeline] sleep
            Sleeping for 40 sec
            Resuming build at Fri Feb 10 13:07:23 EST 2017 after Jenkins restart
            Ready to run at Fri Feb 10 13:07:24 EST 2017
            Sleeping for 7.1 sec
            [Pipeline] }
            [Pipeline] // node
            [Pipeline] End of Pipeline
            Finished: SUCCESS
            

            matthall-50 did not:

            Started by upstream project "matthall-main" build number 1
            originally caused by:
             Started by user admin
            [Pipeline] node
            Running on master in …/workspace/matthall-50
            [Pipeline] {
            [Pipeline] sleep
            Sleeping for 50 sec
            Resuming build at Fri Feb 10 13:07:26 EST 2017 after Jenkins restart
            Waiting to resume Unknown Pipeline node step: ???
            Ready to run at Fri Feb 10 13:07:27 EST 2017
            

            with thread dump

            Thread #2
            	at DSL.sleep(should have stopped sleeping 1 min 52 sec)
            	at WorkflowScript.run(WorkflowScript:1)
            	at DSL.node(running on )
            	at WorkflowScript.run(WorkflowScript:1)
            

            I updated to a release candidate of workflow-basic-steps 2.4 and tried again, but still it fails (this time in matthall-40 #2). Looking into why…

            Show
            jglick Jesse Glick added a comment - Matthew Hall on the same configuration (incl. pipeline-build-step 2.4), created job matthall-main : Map parallel_jobs = [ 'branch_1' : {build job: 'matthall-50' }, 'branch_2' : {build job: 'matthall-40' }] parallel parallel_jobs and matthall-50 : node { sleep(50) } and matthall-40 : node { sleep(40) } Clicked Build Now on matthall-main ; went to dashboard to see that matthall-50 and matthall-40 were both building on heavyweight executors; /restart . matthall-40 resumed: Started by upstream project "matthall-main" build number 1 originally caused by: Started by user admin [Pipeline] node Running on master in …/workspace/matthall-40 [Pipeline] { [Pipeline] sleep Sleeping for 40 sec Resuming build at Fri Feb 10 13:07:23 EST 2017 after Jenkins restart Ready to run at Fri Feb 10 13:07:24 EST 2017 Sleeping for 7.1 sec [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Finished: SUCCESS matthall-50 did not: Started by upstream project "matthall-main" build number 1 originally caused by: Started by user admin [Pipeline] node Running on master in …/workspace/matthall-50 [Pipeline] { [Pipeline] sleep Sleeping for 50 sec Resuming build at Fri Feb 10 13:07:26 EST 2017 after Jenkins restart Waiting to resume Unknown Pipeline node step: ??? Ready to run at Fri Feb 10 13:07:27 EST 2017 with thread dump Thread #2 at DSL.sleep(should have stopped sleeping 1 min 52 sec) at WorkflowScript.run(WorkflowScript:1) at DSL.node(running on ) at WorkflowScript.run(WorkflowScript:1) I updated to a release candidate of workflow-basic-steps 2.4 and tried again, but still it fails (this time in matthall-40 #2 ). Looking into why…
            Hide
            jglick Jesse Glick added a comment -

            Well diagnosed that one anyway—when two Pipeline builds are started at essentially the same moment, the registry of running builds can lose one of them, apparently due to a flaw in FlowExecutionList.saveLater, causing it to not resume after Jenkins restart.

            Show
            jglick Jesse Glick added a comment - Well diagnosed that one anyway—when two Pipeline builds are started at essentially the same moment, the registry of running builds can lose one of them, apparently due to a flaw in FlowExecutionList.saveLater , causing it to not resume after Jenkins restart.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            pom.xml
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
            src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
            http://jenkins-ci.org/commit/workflow-api-plugin/643dc718a18858e7c65f227daa18c998820dafc6
            Log:
            [FIXED JENKINS-40771] FlowExecutionList.register (and .unregister) was incorrectly loading from disk, causing a race condition with asynchronous saves.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java http://jenkins-ci.org/commit/workflow-api-plugin/643dc718a18858e7c65f227daa18c998820dafc6 Log: [FIXED JENKINS-40771] FlowExecutionList.register (and .unregister) was incorrectly loading from disk, causing a race condition with asynchronous saves.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
            src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
            http://jenkins-ci.org/commit/workflow-api-plugin/d7575cae43019af2e3f80bfc7248688ff6393a46
            Log:
            Merge pull request #31 from jglick/FlowExecutionList-JENKINS-40771

            JENKINS-40771 FlowExecutionList race condition

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java http://jenkins-ci.org/commit/workflow-api-plugin/d7575cae43019af2e3f80bfc7248688ff6393a46 Log: Merge pull request #31 from jglick/FlowExecutionList- JENKINS-40771 JENKINS-40771 FlowExecutionList race condition

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                derng Tuan Nguyen
              • Votes:
                3 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: