Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25727

Occasional exit status -1 with long latencies

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Component/s: durable-task-plugin
    • Labels:
    • Environment:
      Jenkins ver. 1.580.1.1-beta-6 (Jenkins Enterprise by CloudBees 14.11)
    • Similar Issues:

      Description

      org.jenkinsci.plugins.workflow.cps.steps.ParallelStepException: Parallel step long running test task failed
      	at org.jenkinsci.plugins.workflow.cps.steps.ParallelStep$ResultHandler$Callback.checkAllDone(ParallelStep.java:126)
      	at org.jenkinsci.plugins.workflow.cps.steps.ParallelStep$ResultHandler$Callback.onFailure(ParallelStep.java:105)
      	at org.jenkinsci.plugins.workflow.cps.CpsBodyExecution$FailureAdapter.receive(CpsBodyExecution.java:295)
      	at com.cloudbees.groovy.cps.impl.ThrowBlock$1.receive(ThrowBlock.java:68)
      	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
      	at com.cloudbees.groovy.cps.Next.step(Next.java:58)
      	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:145)
      	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:164)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:262)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$000(CpsThreadGroup.java:70)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:174)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:172)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:47)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:111)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:744)
      Caused by: hudson.AbortException: script returned exit code -1
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:205)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:159)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
      	... 3 more
      Finished: FAILURE
      

        Attachments

          Issue Links

            Activity

            Hide
            kohsuke Kohsuke Kawaguchi added a comment -

            The fake exit code "-1" signifies that the process has disappeared without leaving the exit code file behind.

            Looking into why this is the case.

            Show
            kohsuke Kohsuke Kawaguchi added a comment - The fake exit code "-1" signifies that the process has disappeared without leaving the exit code file behind. Looking into why this is the case.
            Hide
            kohsuke Kohsuke Kawaguchi added a comment -

            valentina armenise confirmed that the master and a slave has a large latency between them, and that the issue happens about once in 10. So the race condition hypothesis feels more feasible.

            Show
            kohsuke Kohsuke Kawaguchi added a comment - valentina armenise confirmed that the master and a slave has a large latency between them, and that the issue happens about once in 10. So the race condition hypothesis feels more feasible.
            Hide
            jglick Jesse Glick added a comment -

            88aed02 was apparently not enough. We have a PID for the wrapper script, and jenkins-result.txt has not been created, yet the wrapper script does not seem to be running any more. Unclear what leads to this situation.

            Show
            jglick Jesse Glick added a comment - 88aed02 was apparently not enough. We have a PID for the wrapper script, and jenkins-result.txt has not been created, yet the wrapper script does not seem to be running any more. Unclear what leads to this situation.
            Hide
            jglick Jesse Glick added a comment -

            Kohsuke Kawaguchi suggests that ShellController.exitStatus hits a race condition: it calls exitStatus while the process is running, which returns null, then the process finishes, then isAlive is called and says it is not running.

            The probable fix is to recheck exitStatus before returning -1.

            Show
            jglick Jesse Glick added a comment - Kohsuke Kawaguchi suggests that ShellController.exitStatus hits a race condition: it calls exitStatus while the process is running, which returns null, then the process finishes, then isAlive is called and says it is not running. The probable fix is to recheck exitStatus before returning -1.
            Hide
            jglick Jesse Glick added a comment -

            In fact he already attempted that fix in https://github.com/jenkinsci/durable-task-plugin/commit/10a3ebdc1e4825fd334cfe58ecf294c9384d5f06 though this is not complete.

            Show
            jglick Jesse Glick added a comment - In fact he already attempted that fix in https://github.com/jenkinsci/durable-task-plugin/commit/10a3ebdc1e4825fd334cfe58ecf294c9384d5f06 though this is not complete.
            Hide
            jglick Jesse Glick added a comment -

            Released attempted fix in Durable Task plugin 1.0.

            Show
            jglick Jesse Glick added a comment - Released attempted fix in Durable Task plugin 1.0.
            Hide
            varmenise valentina armenise added a comment -

            tested with the plugin version 1.0. It worked

            Show
            varmenise valentina armenise added a comment - tested with the plugin version 1.0. It worked
            Hide
            sumdumgai A C added a comment -

            Reopening. A similar hang is occasionally occurring again with Jenkins 1.6.13 - 1.6.15 and Workflow 1.6 during bat steps in Windows. The process that workflow is waiting for has successfully ended according to log output, there are no rogue generated batch files and directories left either, so it still seems like a race condition exists.

            Why is this concatenating to a jenkins-results.txt file anyway, that doesn't seem very robust? Can't we just re-pipe standard out and standard error directly?

            Show
            sumdumgai A C added a comment - Reopening. A similar hang is occasionally occurring again with Jenkins 1.6.13 - 1.6.15 and Workflow 1.6 during bat steps in Windows. The process that workflow is waiting for has successfully ended according to log output, there are no rogue generated batch files and directories left either, so it still seems like a race condition exists. Why is this concatenating to a jenkins-results.txt file anyway, that doesn't seem very robust? Can't we just re-pipe standard out and standard error directly?
            Hide
            jglick Jesse Glick added a comment -

            A C I am not sure what bug you are seeing but it sounds different than this one. Better to file separately. Note that 1.7 included a fix in a related area.

            Show
            jglick Jesse Glick added a comment - A C I am not sure what bug you are seeing but it sounds different than this one. Better to file separately. Note that 1.7 included a fix in a related area.
            Hide
            sumdumgai A C added a comment -

            OK. WF 1.7 seems to have made this problem worse, tracking a possibly related symptom in JENKINS-28604.

            Show
            sumdumgai A C added a comment - OK. WF 1.7 seems to have made this problem worse, tracking a possibly related symptom in JENKINS-28604 .

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                varmenise valentina armenise
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: