Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59073

Random InterruptedException in pipeline builds

    Details

    • Similar Issues:

      Description

      Within pipeline builds, shell steps randomly fail with an unspecific java.lang.InterruptedException, a full stack trace is listed below.

      Unfortunately, this happens often enough to be a major issue within our development process since negative build results cannot be trusted and builds of multi-hour length might have to be retriggered multiple times.

      Since we cannot reliably trigger the issue, I cannot provide an minimal example for reproduction. This is especially painful since all debugging has to happen in production.

      Background information:

      • Our slaves are started dynamically using the swarm plugin
      • The orchestration of these slaves is handled by a shared library, the respective step is [available on github]
      • We've only seen the exception occur on shell steps, other steps do not seem to throw (although not many were tested)
      • Only the first shell step might throw, if it succeeds the others will be fine
      • master can catch the Exception and continue with error handling

      Complete stack trace:

      java.lang.InterruptedException
      	at java.lang.Object.wait(Native Method)
      	at hudson.remoting.Request.call(Request.java:177)
      	at hudson.remoting.Channel.call(Channel.java:956)
      	at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1060)
      	at hudson.Launcher$ProcStarter.start(Launcher.java:455)
      	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:194)
      	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:317)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:286)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:179)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
      	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:48)
      	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      	at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
      Caused: java.io.InterruptedIOException
      	at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1062)
      	at hudson.Launcher$ProcStarter.start(Launcher.java:455)
      	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:194)
      	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:317)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:286)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:179)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
      	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:48)
      	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      	at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
      	at jesh.call(jesh.groovy:28)
      	at withModules.call(withModules.groovy:45)
      	at WorkflowScript.run(WorkflowScript:57)
      	at onSlurmResource.call(onSlurmResource.groovy:46)
      	at runOnSlave.call(runOnSlave.groovy:37)
      	at ___cps.transform___(Native Method)
      	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:84)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
      	at sun.reflect.GeneratedMethodAccessor395.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
      	at com.cloudbees.groovy.cps.impl.LocalVariableBlock$LocalVariable.get(LocalVariableBlock.java:39)
      	at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30)
      	at com.cloudbees.groovy.cps.impl.LocalVariableBlock.evalLValue(LocalVariableBlock.java:28)
      	at com.cloudbees.groovy.cps.LValueBlock$BlockImpl.eval(LValueBlock.java:55)
      	at com.cloudbees.groovy.cps.LValueBlock.eval(LValueBlock.java:16)
      	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
      	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
      	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:186)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:370)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:93)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:282)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:270)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:66)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

        Attachments

          Activity

          Hide
          basil Basil Crow added a comment -

          I happened to come across this bug while triaging the Swarm component. Unclear what the cause of your problem is so far. It might be related to Swarm, or it might be related to durable-task or Remoting. It would be helpful to know what versions of the Swarm plugin, Swarm client, durable-task, and workflow-durable-task you are running. From these we would be able to tell what version of Remoting you are using on each side of the connection. If these plugins aren't already up-to-date, try updating them first.

          The proximate cause of your problem given the above stack trace is hudson.remoting.Request.call(Request.java:177):

                              while(response==null && !channel.isInClosed())
                                  // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
                                  // but in production I've observed that in rare occasion it can block forever, even after a channel
                                  // is gone. So be defensive against that.
                                  wait(30*1000); <--- cause of interruption
          

          Here the Jenkins master is timing out after waiting for 30 seconds for some type of response from the agent over Remoting. It then throws an InterruptedException which causes the job to fail. You should try to look into the other side of the connection (the agent side) to see why it stopped responding to the master. Try turning up the logging as high as possible on the Swarm client side and see if anything suspicious is present there.

          Show
          basil Basil Crow added a comment - I happened to come across this bug while triaging the Swarm component. Unclear what the cause of your problem is so far. It might be related to Swarm, or it might be related to durable-task or Remoting. It would be helpful to know what versions of the Swarm plugin, Swarm client, durable-task , and workflow-durable-task you are running. From these we would be able to tell what version of Remoting you are using on each side of the connection. If these plugins aren't already up-to-date, try updating them first. The proximate cause of your problem given the above stack trace is hudson.remoting.Request.call(Request.java:177) : while(response==null && !channel.isInClosed()) // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel, // but in production I've observed that in rare occasion it can block forever, even after a channel // is gone. So be defensive against that. wait(30*1000); <--- cause of interruption Here the Jenkins master is timing out after waiting for 30 seconds for some type of response from the agent over Remoting. It then throws an InterruptedException which causes the job to fail. You should try to look into the other side of the connection (the agent side) to see why it stopped responding to the master. Try turning up the logging as high as possible on the Swarm client side and see if anything suspicious is present there.
          Hide
          basil Basil Crow added a comment -

          The fact that the Jenkins master could not reach the Swarm client for 30 seconds points to either a networking issue at the deployment site or resource saturation (and therefore network unresponsiveness) on the Swarm client side. To debug this, one would need to analyze the logs on both the Jenkins master side as well as the Swarm client side, after turning up logging as described in the documentation. One should also monitor CPU, memory, and disk utilization (broadly speaking, "system load") on the Swarm client side during the problematic times. Note that as of the fixes for JENKINS-41854 and JENKINS-50504, Jenkins agents should be resilient to temporary disconnects. These fixes are present in:

          • Jenkins core 2.176 LTS or later
          • workflow-basic-steps 2.17 or later
          • workflow-cps 2.70 or later
          • workflow-durable-task-step 2.31 or later
          • workflow-step-api 2.20 or later
          • workflow-support-plugin 3.3 or later

          I am closing this bug as "Cannot Reproduce". I suggest following the debugging instructions given above and contacting the Jenkins users list if you need any additional assistance. If you have cause to suspect a bug in Swarm itself, please open a new issue with detailed steps to reproduce (modifying PipelineJobRestartTest in the plugin test suite would be a good place to start).

          Show
          basil Basil Crow added a comment - The fact that the Jenkins master could not reach the Swarm client for 30 seconds points to either a networking issue at the deployment site or resource saturation (and therefore network unresponsiveness) on the Swarm client side. To debug this, one would need to analyze the logs on both the Jenkins master side as well as the Swarm client side, after turning up logging as described in the documentation . One should also monitor CPU, memory, and disk utilization (broadly speaking, "system load") on the Swarm client side during the problematic times. Note that as of the fixes for JENKINS-41854 and JENKINS-50504 , Jenkins agents should be resilient to temporary disconnects. These fixes are present in: Jenkins core 2.176 LTS or later workflow-basic-steps 2.17 or later workflow-cps 2.70 or later workflow-durable-task-step 2.31 or later workflow-step-api 2.20 or later workflow-support-plugin 3.3 or later I am closing this bug as "Cannot Reproduce". I suggest following the debugging instructions given above and contacting the Jenkins users list if you need any additional assistance. If you have cause to suspect a bug in Swarm itself, please open a new issue with detailed steps to reproduce (modifying PipelineJobRestartTest in the plugin test suite would be a good place to start).

            People

            • Assignee:
              Unassigned
              Reporter:
              chaddo Chad Williams
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: