Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59073

Random InterruptedException in pipeline builds

    Details

    • Similar Issues:

      Description

      Within pipeline builds, shell steps randomly fail with an unspecific java.lang.InterruptedException, a full stack trace is listed below.

      Unfortunately, this happens often enough to be a major issue within our development process since negative build results cannot be trusted and builds of multi-hour length might have to be retriggered multiple times.

      Since we cannot reliably trigger the issue, I cannot provide an minimal example for reproduction. This is especially painful since all debugging has to happen in production.

      Background information:

      • Our slaves are started dynamically using the swarm plugin
      • The orchestration of these slaves is handled by a shared library, the respective step is [available on github]
      • We've only seen the exception occur on shell steps, other steps do not seem to throw (although not many were tested)
      • Only the first shell step might throw, if it succeeds the others will be fine
      • master can catch the Exception and continue with error handling

      Complete stack trace:

      java.lang.InterruptedException
      	at java.lang.Object.wait(Native Method)
      	at hudson.remoting.Request.call(Request.java:177)
      	at hudson.remoting.Channel.call(Channel.java:956)
      	at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1060)
      	at hudson.Launcher$ProcStarter.start(Launcher.java:455)
      	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:194)
      	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:317)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:286)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:179)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
      	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:48)
      	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      	at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
      Caused: java.io.InterruptedIOException
      	at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1062)
      	at hudson.Launcher$ProcStarter.start(Launcher.java:455)
      	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:194)
      	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:317)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:286)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:179)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
      	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:48)
      	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      	at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
      	at jesh.call(jesh.groovy:28)
      	at withModules.call(withModules.groovy:45)
      	at WorkflowScript.run(WorkflowScript:57)
      	at onSlurmResource.call(onSlurmResource.groovy:46)
      	at runOnSlave.call(runOnSlave.groovy:37)
      	at ___cps.transform___(Native Method)
      	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:84)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
      	at sun.reflect.GeneratedMethodAccessor395.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
      	at com.cloudbees.groovy.cps.impl.LocalVariableBlock$LocalVariable.get(LocalVariableBlock.java:39)
      	at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30)
      	at com.cloudbees.groovy.cps.impl.LocalVariableBlock.evalLValue(LocalVariableBlock.java:28)
      	at com.cloudbees.groovy.cps.LValueBlock$BlockImpl.eval(LValueBlock.java:55)
      	at com.cloudbees.groovy.cps.LValueBlock.eval(LValueBlock.java:16)
      	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
      	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
      	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:186)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:370)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:93)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:282)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:270)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:66)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

        Attachments

          Activity

          Hide
          basil Basil Crow added a comment -

          I happened to come across this bug while triaging the Swarm component. Unclear what the cause of your problem is so far. It might be related to Swarm, or it might be related to durable-task or Remoting. It would be helpful to know what versions of the Swarm plugin, Swarm client, durable-task, and workflow-durable-task you are running. From these we would be able to tell what version of Remoting you are using on each side of the connection. If these plugins aren't already up-to-date, try updating them first.

          The proximate cause of your problem given the above stack trace is hudson.remoting.Request.call(Request.java:177):

                              while(response==null && !channel.isInClosed())
                                  // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
                                  // but in production I've observed that in rare occasion it can block forever, even after a channel
                                  // is gone. So be defensive against that.
                                  wait(30*1000); <--- cause of interruption
          

          Here the Jenkins master is timing out after waiting for 30 seconds for some type of response from the agent over Remoting. It then throws an InterruptedException which causes the job to fail. You should try to look into the other side of the connection (the agent side) to see why it stopped responding to the master. Try turning up the logging as high as possible on the Swarm client side and see if anything suspicious is present there.

          Show
          basil Basil Crow added a comment - I happened to come across this bug while triaging the Swarm component. Unclear what the cause of your problem is so far. It might be related to Swarm, or it might be related to durable-task or Remoting. It would be helpful to know what versions of the Swarm plugin, Swarm client, durable-task , and workflow-durable-task you are running. From these we would be able to tell what version of Remoting you are using on each side of the connection. If these plugins aren't already up-to-date, try updating them first. The proximate cause of your problem given the above stack trace is hudson.remoting.Request.call(Request.java:177) : while(response==null && !channel.isInClosed()) // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel, // but in production I've observed that in rare occasion it can block forever, even after a channel // is gone. So be defensive against that. wait(30*1000); <--- cause of interruption Here the Jenkins master is timing out after waiting for 30 seconds for some type of response from the agent over Remoting. It then throws an InterruptedException which causes the job to fail. You should try to look into the other side of the connection (the agent side) to see why it stopped responding to the master. Try turning up the logging as high as possible on the Swarm client side and see if anything suspicious is present there.

            People

            • Assignee:
              Unassigned
              Reporter:
              chaddo Chad Williams
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: