Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-37730

DurableTaskStep.Execution hanging after process is dead

    Details

    • Similar Issues:

      Description

      Found a case where a sh step ceased to produce more output in the middle of a command, for no apparent reason, and the build did not respond to normal abort. The virtual thread dump said

      Thread #80
      	at DSL.sh(completed process (code -1) in /...@tmp/durable-... on ... (pid: ...))
      	at ...
      

      But there is no active CPS VM thread, and nothing visibly happening on the agent, and all Timer threads are idle. So it seems that a call to check would have caused the step to fail—but perhaps none came?

      Possibly stop should do its own check for a non-null Controller.exitStatus and immediately fail in such a case (but we run the risk of delivering doubled-up events if check does run later); or synchronously call check (though this runs the risk of having two such calls run simultaneously—it is not thread safe); or somehow reschedule it (same problem).

      At a minimum, the virtual thread dump should indicate what the current recurrencePeriod is. And the calls to schedule could save their ScheduledFuture results in a transient field, so we can check cancelled and done flags. Such diagnostics might make it clearer next time what actually happened.

      Also a term claimed to be terminating the sh step, but the build still did not finish. Again nothing in the physical thread dumps, and virtual thread dump still claims to be inside sh. System log showed

      ... WARNING org.jenkinsci.plugins.workflow.cps.CpsStepContext onFailure
      already completed CpsStepContext[186]:Owner[...]
      java.lang.IllegalStateException: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:325)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun$5.onSuccess(WorkflowRun.java:300)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun$5.onSuccess(WorkflowRun.java:296)
      	at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150)
      	at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
      	at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
      	at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:134)
      	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:170)
      	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:53)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5.onSuccess(CpsFlowExecution.java:702)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5.onSuccess(CpsFlowExecution.java:689)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:626)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:32)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.doTerm(WorkflowRun.java:295)
      	at ...
      

      So the program state seems to be somehow inconsistent as well; perhaps sh did complete (it is not shown as in progress in flowGraphTable).

      Seems that the virtual thread dump needs some kind of fix TBD to better report the real state of a problematic program.

        Attachments

          Issue Links

            Activity

            Hide
            eeaston Edward Easton added a comment - - edited

            Hi there,
            Just letting you know this is hitting me in the wild, thanks for raising this! I get the exact same traceback above in the server logs.

            A little background - I've built a moderately complex CPSWorkflowLib for building in-house Python projects with a DSL. The DSL specifies named test stages with simple closures to specify the stage body, and a number of workers to spread the stage over.
            This has been working well until I updated to the latest pipeline plugin versions a few days ago; now whenever one of the test stages raises a non-zero exit code in a `sh` step it will hang the build.

            I'm trying to come up with a minimal test case that reproduces the problem, it's tricky as there's a lot of setup code in the DSL library that might be affecting things. I'll post it when I get something workable.

            Versions: Jenkins 2.7.4, workflow-cps-global-lib 2.3
            I might also add this isn't 'minor' ! I can no longer use Pipeline until I find a workaround for this :/

            Show
            eeaston Edward Easton added a comment - - edited Hi there, Just letting you know this is hitting me in the wild, thanks for raising this! I get the exact same traceback above in the server logs. A little background - I've built a moderately complex CPSWorkflowLib for building in-house Python projects with a DSL. The DSL specifies named test stages with simple closures to specify the stage body, and a number of workers to spread the stage over. This has been working well until I updated to the latest pipeline plugin versions a few days ago; now whenever one of the test stages raises a non-zero exit code in a `sh` step it will hang the build. I'm trying to come up with a minimal test case that reproduces the problem, it's tricky as there's a lot of setup code in the DSL library that might be affecting things. I'll post it when I get something workable. Versions: Jenkins 2.7.4, workflow-cps-global-lib 2.3 I might also add this isn't 'minor' ! I can no longer use Pipeline until I find a workaround for this :/
            Hide
            eeaston Edward Easton added a comment -

            Hi, I traced one of the hangs to the problem mentioned here: JENKINS-38566

            Show
            eeaston Edward Easton added a comment - Hi, I traced one of the hangs to the problem mentioned here: JENKINS-38566
            Hide
            recampbell Ryan Campbell added a comment -

            Marking this as critical since it results in unkillable job on the agent, making the agent unusable. Is there a workaround?

            Show
            recampbell Ryan Campbell added a comment - Marking this as critical since it results in unkillable job on the agent, making the agent unusable. Is there a workaround?
            Hide
            jglick Jesse Glick added a comment -

            The original issue did not result in an unkillable build, just a failure to respond to term (did respond to kill), so Ryan Campbell is possibly seeing something unrelated (and Edward Easton definitely was). This issue is not about a root cause, which remains unknown, but about a robustness aspect of the response to some other bug.

            Show
            jglick Jesse Glick added a comment - The original issue did not result in an unkillable build, just a failure to respond to term (did respond to kill ), so Ryan Campbell is possibly seeing something unrelated (and Edward Easton definitely was). This issue is not about a root cause, which remains unknown, but about a robustness aspect of the response to some other bug.
            Hide
            jglick Jesse Glick added a comment -

            Possibly stop should do its own check for a non-null Controller.exitStatus and immediately fail in such a case

            Plan to handle that differently in JENKINS-38769, by just making sure stop always stops the step, regardless of process exit status.

            Show
            jglick Jesse Glick added a comment - Possibly stop should do its own check for a non-null Controller.exitStatus and immediately fail in such a case Plan to handle that differently in JENKINS-38769 , by just making sure stop always stops the step, regardless of process exit status.
            Hide
            jglick Jesse Glick added a comment -

            Fixed at least the diagnostic aspects.

            Show
            jglick Jesse Glick added a comment - Fixed at least the diagnostic aspects.
            Hide
            jonasschneider Jonas Schneider added a comment -

            We're seeing the same behavior with a roughly similar pipeline, running Docker image `jenkinsci/blueocean:1.0.1` (Jenkins ver. 2.46.1) for master and slave. Abridged pipeline script, thread dump and process dump at:

            https://gist.github.com/jonasschneider/ed81faffdd96d3e541cb6f487871029a

            After the `docker run` finishes, Jenkins for some reason does not reap the `docker` processes as can be seen in the ps output. This is a blocker, since it hangs our ~10minute builds for multiple hours on end  . We've only seen it appear under some amount of load, that is, when multiple builds of the same job are running. 

            Is there any way to better debug what's going on here?

            Show
            jonasschneider Jonas Schneider added a comment - We're seeing the same behavior with a roughly similar pipeline, running Docker image `jenkinsci/blueocean:1.0.1` (Jenkins ver. 2.46.1) for master and slave. Abridged pipeline script, thread dump and process dump at: https://gist.github.com/jonasschneider/ed81faffdd96d3e541cb6f487871029a After the `docker run` finishes, Jenkins for some reason does not reap the `docker` processes as can be seen in the ps output. This is a blocker, since it hangs our ~10minute builds for multiple hours on end  . We've only seen it appear under some amount of load, that is, when multiple builds of the same job are running.  Is there any way to better debug what's going on here?
            Hide
            jglick Jesse Glick added a comment -

            Jonas Schneider your issue is probably unrelated. While I do not know offhand what would cause the build to “hang” (use the Thread Dump link to get some clues), certainly your script looks wrong since you are launching Docker containers which you never stop. Perhaps you forgot --rm?

            Show
            jglick Jesse Glick added a comment - Jonas Schneider your issue is probably unrelated. While I do not know offhand what would cause the build to “hang” (use the Thread Dump link to get some clues), certainly your script looks wrong since you are launching Docker containers which you never stop. Perhaps you forgot --rm ?
            Hide
            jonasschneider Jonas Schneider added a comment -

            Sorry, I must've accidentally removed the thread dump from the gist. Here it is: https://gist.github.com/jonasschneider/ed81faffdd96d3e541cb6f487871029a#file-thread-dump-txt – to me this looks like the docker run command has exited, but Jenkins is not proceeding with the execution.

            Indeed I am using docker run without --rm. However, docker run will still exit as soon as the script inside exits. The container stops at that time. We do clean up the actual container filesystem later with a docker rm, which I omitted from the Jenkinsfile.

            Show
            jonasschneider Jonas Schneider added a comment - Sorry, I must've accidentally removed the thread dump from the gist. Here it is:  https://gist.github.com/jonasschneider/ed81faffdd96d3e541cb6f487871029a#file-thread-dump-txt – to me this looks like the docker run  command has exited, but Jenkins is not proceeding with the execution. Indeed I am using docker run  without --rm . However, docker run  will still exit as soon as the script inside exits. The container stops at that time. We do clean up the actual container filesystem later with a docker rm , which I omitted from the Jenkinsfile.
            Hide
            jglick Jesse Glick added a comment -

            Jonas Schneider not sure offhand, something clearly went haywire there. Any interesting system log output? Without steps to reproduce I am not sure how to help.

            Show
            jglick Jesse Glick added a comment - Jonas Schneider not sure offhand, something clearly went haywire there. Any interesting system log output? Without steps to reproduce I am not sure how to help.

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                jglick Jesse Glick
              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: