Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-38383

StackOverflow when loading a large pipeline job

    Details

    • Similar Issues:

      Description

      I've been using infinite Pipeline for a demo. The job was running for a couple of hours, and then Jenkins master declined to the build due to the StackOverflow.

      node {
          int i = 0;
          while(true) {
              sh "echo 'Hello, world ${i}!'"
              sh "sleep 5"
              i = i + 1;
          }
      }
      

      Log: See attachment

        Attachments

        1. InfinitePipeline.zip
          1.45 MB
        2. log.txt
          7.50 MB
        3. workflow-job.hpi
          109 kB

          Issue Links

            Activity

            Hide
            svanoort Sam Van Oort added a comment -

            Problem is here: we're doing a recursive call to get the logs - https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L478

            Needs to be rewritten as a while(...) loop with the current search candidate inside the loop. However I understand that Jesse Glick is doing some significant rewrites of log storage at the moment, so I'll hold off touching the code to avoid collisions.

            Side note, this is explicitly a "don't do that" in my recent Jenkins World talk :-P

            Show
            svanoort Sam Van Oort added a comment - Problem is here: we're doing a recursive call to get the logs - https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L478 Needs to be rewritten as a while(...) loop with the current search candidate inside the loop. However I understand that Jesse Glick is doing some significant rewrites of log storage at the moment, so I'll hold off touching the code to avoid collisions. Side note, this is explicitly a "don't do that" in my recent Jenkins World talk :-P
            Hide
            svanoort Sam Van Oort added a comment -

            Downgrading priority because it should only happen if you have to search many nodes to find one with a log.

            That said, there's some weirdness with the logic here.

            Show
            svanoort Sam Van Oort added a comment - Downgrading priority because it should only happen if you have to search many nodes to find one with a log. That said, there's some weirdness with the logic here.
            Hide
            thxmasj Thomas Johansen added a comment - - edited

            I get a similar symptom on startup with Jenkins version 2.102. The stacktrace is now:

            [...] com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError
            2018-02-22T11:17:37.589287583Z  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2232)
            2018-02-22T11:17:37.589290948Z  at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
            2018-02-22T11:17:37.589293750Z  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
            2018-02-22T11:17:37.589296421Z  at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
            2018-02-22T11:17:37.589299106Z  at com.google.common.cache.LocalCache$LocalManualCache.getUnchecked(LocalCache.java:4834)
            2018-02-22T11:17:37.589301839Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun.getLogPrefix(WorkflowRun.java:556)
            2018-02-22T11:17:37.589304561Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$900(WorkflowRun.java:134)
            2018-02-22T11:17:37.589308088Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:547)
            2018-02-22T11:17:37.589310809Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:537)
            2018-02-22T11:17:37.589313487Z  at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
            2018-02-22T11:17:37.589316198Z  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
            2018-02-22T11:17:37.589318881Z  at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
            2018-02-22T11:17:37.589323997Z  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
            2018-02-22T11:17:37.589326739Z  ... 222 more
            2018-02-22T11:17:37.589333918Z Caused by: com.google.common.util.concurrent.ExecutionError [...]

            Jenkins is then irresponsive for some minutes, and also it seems like the jobs resumed while in the loop is sometimes stuck after restart.

            Is this because I'm having pipeline-jobs using while-loops?

            Show
            thxmasj Thomas Johansen added a comment - - edited I get a similar symptom on startup with Jenkins version 2.102. The stacktrace is now: [...] com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError 2018-02-22T11:17:37.589287583Z at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2232) 2018-02-22T11:17:37.589290948Z at com.google.common.cache.LocalCache.get(LocalCache.java:3965) 2018-02-22T11:17:37.589293750Z at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969) 2018-02-22T11:17:37.589296421Z at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829) 2018-02-22T11:17:37.589299106Z at com.google.common.cache.LocalCache$LocalManualCache.getUnchecked(LocalCache.java:4834) 2018-02-22T11:17:37.589301839Z at org.jenkinsci.plugins.workflow.job.WorkflowRun.getLogPrefix(WorkflowRun.java:556) 2018-02-22T11:17:37.589304561Z at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$900(WorkflowRun.java:134) 2018-02-22T11:17:37.589308088Z at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:547) 2018-02-22T11:17:37.589310809Z at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:537) 2018-02-22T11:17:37.589313487Z at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) 2018-02-22T11:17:37.589316198Z at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) 2018-02-22T11:17:37.589318881Z at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) 2018-02-22T11:17:37.589323997Z at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) 2018-02-22T11:17:37.589326739Z ... 222 more 2018-02-22T11:17:37.589333918Z Caused by: com.google.common.util.concurrent.ExecutionError [...] Jenkins is then irresponsive for some minutes, and also it seems like the jobs resumed while in the loop is sometimes stuck after restart. Is this because I'm having pipeline-jobs using while-loops?
            Hide
            svanoort Sam Van Oort added a comment -

            Would be resolved by JENKINS-38381 (long handling rewrite).  This is the third bug I've seen tied to the logPrefix cache and similar logic.   

            This officially means it's time for at least a fast-patch solution to tide us over until the comprehensive solution lands with the rewrite (my next Big Work Item).

            Thomas Johansen the issue is that your pipeline is running a very large number of steps (several thousand generally) and under the wrong circumstances this particular operation converts to rather nasty recursive call.  

            I'm going to take this on for fix in the next week or so. 

            Show
            svanoort Sam Van Oort added a comment - Would be resolved by JENKINS-38381 (long handling rewrite).  This is the third bug I've seen tied to the logPrefix cache and similar logic.    This officially means it's time for at least a fast-patch solution to tide us over until the comprehensive solution lands with the rewrite (my next Big Work Item). Thomas Johansen the issue is that your pipeline is running a very large number of steps (several thousand generally) and under the wrong circumstances this particular operation converts to rather nasty recursive call.   I'm going to take this on for fix in the next week or so. 
            Hide
            thxmasj Thomas Johansen added a comment -

            Sam Van Oort: If I understand you correctly JENKINS-38381 will resolve my issue, and you aim to fix it in the close future? I am still considering to rewrite my pipelines to avoid while-loops (which result in thousands of steps), as it seems like the Jenkins-code is not designed for this. Btw I'm using loops for polling Jira and Git. My rewrite would then be to move the loop inside a shell-step.

            Show
            thxmasj Thomas Johansen added a comment - Sam Van Oort : If I understand you correctly JENKINS-38381 will resolve my issue, and you aim to fix it in the close future? I am still considering to rewrite my pipelines to avoid while-loops (which result in thousands of steps), as it seems like the Jenkins-code is not designed for this. Btw I'm using loops for polling Jira and Git. My rewrite would then be to move the loop inside a shell-step.
            Hide
            svanoort Sam Van Oort added a comment -

            Thomas Johansen I'd suggest definitely using a sh loop or timeout/retry pipeline loops instead (with less loop cycles involved there).  There are negative implications to using a ton of Pipeline steps in general – the Project Cheetah work I launched a little while ago makes it less harmful if you're using the new performance-optimized mode, but it's still not free. In general if your Pipeline is over 1000 FlowNodes (500-700+ steps) then it's a sign one should reconsider the approach used.

            I aim to fix this with JENKINS-38381 but that isn't trivial, so I'm not comfortable giving an explicit date date yet and will look to make tactical fixes initially.

            Show
            svanoort Sam Van Oort added a comment - Thomas Johansen I'd suggest definitely using a sh loop or timeout/retry pipeline loops instead (with less loop cycles involved there).  There are negative implications to using a ton of Pipeline steps in general – the Project Cheetah work I launched a little while ago makes it less harmful if you're using the new performance-optimized mode, but it's still not free. In general if your Pipeline is over 1000 FlowNodes (500-700+ steps) then it's a sign one should reconsider the approach used. I aim to fix this with JENKINS-38381 but that isn't trivial, so I'm not comfortable giving an explicit date date yet and will look to make tactical fixes initially.
            Hide
            thxmasj Thomas Johansen added a comment -

            Sam Van Oort: I already tested a retry step and a retry option on the stages where I poll Jira, but this also seems to produce tons of pipeline steps (I can see a lot of XML-files under the workflow-directory for the build).

            Unfortunately I found that my strategy with sh-loops is a dead-end, as the sh-step requires an agent (with node, workspace and executor). As we might have tens or even hundreds of concurrent builds in a Jira-polling stage for several days those stages should not use agents.

            I had a look at the Cheetah-documentation you linked to, and it looks really interesting. I think it might solve my problem IF the durability option could be set on a per stage level – then I could make my Jira-polling stages PERFORMANCE_OPTIMIZED, which I assume would avoid the StackOverflow-problem for while/retry loops. I will test the option set on pipeline level now (although we need durability for some of the stages).

            Btw I noticed the line "You can force a Pipeline to persist data by pausing it." on the Scaling pipelines page. Can the pipeline be paused/resumed from within itself (using steps?), sort of like a transaction commit?

            Show
            thxmasj Thomas Johansen added a comment - Sam Van Oort : I already tested a retry step and a retry option on the stages where I poll Jira, but this also seems to produce tons of pipeline steps (I can see a lot of XML-files under the workflow-directory for the build). Unfortunately I found that my strategy with sh-loops is a dead-end, as the sh-step requires an agent (with node, workspace and executor). As we might have tens or even hundreds of concurrent builds in a Jira-polling stage for several days those stages should not use agents. I had a look at the Cheetah-documentation you linked to, and it looks really interesting. I think it might solve my problem IF the durability option could be set on a per stage level – then I could make my Jira-polling stages PERFORMANCE_OPTIMIZED, which I assume would avoid the StackOverflow-problem for while/retry loops. I will test the option set on pipeline level now (although we need durability for some of the stages). Btw I noticed the line "You can force a Pipeline to persist data by pausing it." on the Scaling pipelines page . Can the pipeline be paused/resumed from within itself (using steps?), sort of like a transaction commit?
            Hide
            thxmasj Thomas Johansen added a comment -

            Resuming a build after restart with a PERFORMANCE_OPTIMIZED pipeline with 6000+ flow nodes did not give me a hang, but rather a NPE and a failing build instead:

            [...]
            18:00:36 Iteration 6188
            [Pipeline] echo
            Resuming build at Tue Feb 27 18:01:17 GMT 2018 after Jenkins restart
            [Pipeline] End of Pipeline
            java.lang.NullPointerException
            	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:43)
            	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
            	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331)
            	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82)
            	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243)
            	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231)
            	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
            	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            	at java.lang.Thread.run(Thread.java:748)
            Finished: FAILURE
            
            Show
            thxmasj Thomas Johansen added a comment - Resuming a build after restart with a PERFORMANCE_OPTIMIZED pipeline with 6000+ flow nodes did not give me a hang, but rather a NPE and a failing build instead: [...] 18:00:36 Iteration 6188 [Pipeline] echo Resuming build at Tue Feb 27 18:01:17 GMT 2018 after Jenkins restart [Pipeline] End of Pipeline java.lang.NullPointerException at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:43) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Finished: FAILURE
            Hide
            svanoort Sam Van Oort added a comment -

            Solution provided by removing explicit recursion and instead using an iteration-based approach + changes to caching.

            Show
            svanoort Sam Van Oort added a comment - Solution provided by removing explicit recursion and instead using an iteration-based approach + changes to caching.
            Hide
            svanoort Sam Van Oort added a comment -

            Thomas Johansen I'm not sure exactly what's triggering an NPE there (looks like it somehow got caught in an invalid state) – tried to reproduce but I can't.

            But I've got a fix that's verified to remove the StackOverflowException and allow a build to cleanly resume in https://github.com/jenkinsci/workflow-job-plugin/pull/91 – please try the attached snapshot built off of it which should remove your issue.

            In general I'd advise to find alternate courses in Pipeline than explicit polling loops – in this case an enhancement to JIRA plugin or simply accept the use of an agent to do polling while needed (perhaps with exponential backoff). workflow-job.hpi

            Show
            svanoort Sam Van Oort added a comment - Thomas Johansen I'm not sure exactly what's triggering an NPE there (looks like it somehow got caught in an invalid state) – tried to reproduce but I can't. But I've got a fix that's verified to remove the StackOverflowException and allow a build to cleanly resume in https://github.com/jenkinsci/workflow-job-plugin/pull/91 – please try the attached snapshot built off of it which should remove your issue. In general I'd advise to find alternate courses in Pipeline than explicit polling loops – in this case an enhancement to JIRA plugin or simply accept the use of an agent to do polling while needed (perhaps with exponential backoff). workflow-job.hpi
            Hide
            thxmasj Thomas Johansen added a comment -

            Sam Van Oort: I tried the snapshot. No StackOverflowException in the log this time, but the build did not resume either (waited almost half an hour). I'm considering to use a polling agent in combination with the input construct to completely avoid these damned loops.

            [Pipeline] echo
            19:41:58 Iteration 6143
            [Pipeline] echo
            Resuming build at Tue Feb 27 19:42:36 GMT 2018 after Jenkins restart
            Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
            19:41:58 Iteration 6143
            Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
            Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
            Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: Jenkins doesn’t have label jenkins-slave-2
            jenkins-slave-1 doesn’t have label jenkins-slave-2
            jenkins-slave-2 is offline
            Ready to run at Tue Feb 27 19:43:10 GMT 2018
            Show
            thxmasj Thomas Johansen added a comment - Sam Van Oort : I tried the snapshot. No StackOverflowException in the log this time, but the build did not resume either (waited almost half an hour). I'm considering to use a polling agent in combination with the input construct to completely avoid these damned loops. [Pipeline] echo 19:41:58 Iteration 6143 [Pipeline] echo Resuming build at Tue Feb 27 19:42:36 GMT 2018 after Jenkins restart Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline 19:41:58 Iteration 6143 Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: Jenkins doesn’t have label jenkins-slave-2 jenkins-slave-1 doesn’t have label jenkins-slave-2 jenkins-slave-2 is offline Ready to run at Tue Feb 27 19:43:10 GMT 2018
            Hide
            svanoort Sam Van Oort added a comment -

            Thomas Johansen That sounds like an issue I'm tracking that sometimes happens with performance-optimized builds with specific combinations of build agents used (noted in a couple quirks by https://issues.jenkins-ci.org/browse/JENKINS-47173) – do you have a sample of your pipeline here to reproduce with?

            Show
            svanoort Sam Van Oort added a comment - Thomas Johansen That sounds like an issue I'm tracking that sometimes happens with performance-optimized builds with specific combinations of build agents used (noted in a couple quirks by https://issues.jenkins-ci.org/browse/JENKINS-47173 ) – do you have a sample of your pipeline here to reproduce with?
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Sam Van Oort
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
            http://jenkins-ci.org/commit/workflow-job-plugin/896703ee33d7c7678272a1cecb7cba1a4c3e1f63
            Log:
            Merge pull request #91 from svanoort/fix-prefix-bugs

            JENKINS-38383 Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes

            Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/d23e989dc6d5...896703ee33d7

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Sam Van Oort Path: src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java http://jenkins-ci.org/commit/workflow-job-plugin/896703ee33d7c7678272a1cecb7cba1a4c3e1f63 Log: Merge pull request #91 from svanoort/fix-prefix-bugs JENKINS-38383 Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/d23e989dc6d5...896703ee33d7
            Hide
            svanoort Sam Van Oort added a comment -

            Released with v2.19

            Show
            svanoort Sam Van Oort added a comment - Released with v2.19

              People

              • Assignee:
                svanoort Sam Van Oort
                Reporter:
                oleg_nenashev Oleg Nenashev
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: