Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-46507

Parallel Pipeline random java.lang.InterruptedException

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Labels:
      None
    • Environment:
    • Similar Issues:
    • Released As:
      workflow-durable-task-step 2.29

      Description

      In my pipeline job,
      sometimes it'd randomly receive the java.lang.InterruptedException below:

      java.lang.InterruptedException
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
      	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275)
      	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294)
      	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61)
      	at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:251)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:179)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:126)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.println(CpsScript.java:207)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.print(CpsScript.java:202)
      	at sun.reflect.GeneratedMethodAccessor103253.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
      ....
      ....
      

      Please refer to the file attachment for the full console log and the pipeline Jenkinsfile code.

        Attachments

        1. consoleText_ERROR.txt
          95 kB
        2. hs_err_pid239040.log
          84 kB
        3. jenkins.log
          266 kB
        4. Jenkinsfile
          7 kB
        5. Jenkinsfile.txt
          6 kB
        6. stuff.tgz
          310 kB
        7. workflow-durable-task-step.hpi
          85 kB

          Issue Links

            Activity

            Hide
            ncrmnt Andrew a added a comment - - edited

            I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now)

            First one:
            I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning:

            Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling
            

            But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.

             

            Second one:

            In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem:

             Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop
             java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy
             at sun.nio.fs.UnixException.translateToIOException(Unknown Source)
             at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
             at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
             at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source)
             at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source)
             at java.nio.file.Files.deleteIfExists(Unknown Source)
             at hudson.Util.tryOnceDeleteFile(Util.java:316)
             at hudson.Util.deleteFile(Util.java:272)
             Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca
             at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
             at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
             at hudson.remoting.Channel.call(Channel.java:955)
             at hudson.FilePath.act(FilePath.java:1070)
             at hudson.FilePath.act(FilePath.java:1059)
             at hudson.FilePath.deleteRecursive(FilePath.java:1266)
             at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
             at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
             at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
             at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
             at java.util.concurrent.FutureTask.run(FutureTask.java:266)
             at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
             at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
             at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
             at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
             at java.lang.Thread.run(Thread.java:748)
             Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
             at hudson.Util.deleteFile(Util.java:277)
             at hudson.FilePath.deleteRecursive(FilePath.java:1303)
             at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
             at hudson.FilePath.deleteRecursive(FilePath.java:1302)
             at hudson.FilePath.access$1600(FilePath.java:211)
             at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
             at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
             at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
             at hudson.remoting.UserRequest.perform(UserRequest.java:212)
             at hudson.remoting.UserRequest.perform(UserRequest.java:54)
             at hudson.remoting.Request$2.run(Request.java:369)
             at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
             at java.util.concurrent.FutureTask.run(Unknown Source)
             at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
             at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
             at java.lang.Thread.run(Unknown Source)Sending interrupt signal to process
             After 10s process did not stop
             java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy
             at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
             at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
             at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
             at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
             at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
             at java.nio.file.Files.deleteIfExists(Files.java:1165)
             at hudson.Util.tryOnceDeleteFile(Util.java:316)
             at hudson.Util.deleteFile(Util.java:272)
             Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx
             at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
             at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
             at hudson.remoting.Channel.call(Channel.java:955)
             at hudson.FilePath.act(FilePath.java:1070)
             at hudson.FilePath.act(FilePath.java:1059)
             at hudson.FilePath.deleteRecursive(FilePath.java:1266)
             at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
             at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
             at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
             at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
             at java.util.concurrent.FutureTask.run(FutureTask.java:266)
             at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
             at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
             at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
             at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
             Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
             at hudson.Util.deleteFile(Util.java:277)
             at hudson.FilePath.deleteRecursive(FilePath.java:1303)
             at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
             at hudson.FilePath.deleteRecursive(FilePath.java:1302)
             at hudson.FilePath.access$1600(FilePath.java:211)
             at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
             at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
             at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
             at hudson.remoting.UserRequest.perform(UserRequest.java:212)
             at hudson.remoting.UserRequest.perform(UserRequest.java:54)
             at hudson.remoting.Request$2.run(Request.java:369)
             at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
             at java.util.concurrent.FutureTask.run(FutureTask.java:266)
             at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
             at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
             at java.lang.Thread.run(Thread.java:748)[Pipeline] }[Pipeline] }[Pipeline] // timeout[Pipeline] // timeout[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] // dir[Pipeline] // dir[Pipeline] }[Pipeline] }[Pipeline] // node[Pipeline] // node[Pipeline] }[Pipeline] }sh: line 1: 104849 Terminated sleep 3sh: line 1: 163732 Terminated { while [ ( -d /proc/$pid -o ! -d /proc/$$ ) -a -d '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03' -a ! -f '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-result.txt' ]; do
             touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt'; sleep 3;
             done; }
             sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot-default-rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec
            
            
            

            It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes.

            P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely.

            diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
            index 9b449d7..b338690 100644
            --- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
            +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
            @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step {
                             }
                         }
                         boolean directory;
            -            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
            +            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
                             directory = ws.isDirectory();
                         } catch (Exception x) {
                             getWorkspaceProblem(x);
            @@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step {
                                     stopTask = null;
                                     if (recurrencePeriod > 0) {
                                         recurrencePeriod = 0;
            -                            listener().getLogger().println("After 10s process did not stop");
            +                            listener().getLogger().println("After 60s process did not stop");
                                         getContext().onFailure(cause);
                                         try {
                                             FilePath workspace = getWorkspace();
            @@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step {
                                         }
                                     }
                                 }
            -                }, 10, TimeUnit.SECONDS);
            +                }, 60, TimeUnit.SECONDS);
                             controller.stop(workspace, launcher());
                         } else {
                             listener().getLogger().println("Could not connect to " + node + " to send interrupt signal to process");
            @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                             return; // slave not yet ready, wait for another day
                         }
                         TaskListener listener = listener();
            -            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
            +            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
                             if (watching) {
                                 Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                                 if (exitCode == null) {
            
            
            Show
            ncrmnt Andrew a added a comment - - edited I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now) First one: I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning: Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.   Second one: In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem: Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy at sun.nio.fs.UnixException.translateToIOException(Unknown Source) at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source) at java.nio.file.Files.deleteIfExists(Unknown Source) at hudson.Util.tryOnceDeleteFile(Util.java:316) at hudson.Util.deleteFile(Util.java:272) Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:955) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at hudson.FilePath.deleteRecursive(FilePath.java:1266) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716' . Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. at hudson.Util.deleteFile(Util.java:277) at hudson.FilePath.deleteRecursive(FilePath.java:1303) at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312) at hudson.FilePath.deleteRecursive(FilePath.java:1302) at hudson.FilePath.access$1600(FilePath.java:211) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084) at hudson.remoting.UserRequest.perform(UserRequest.java:212) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source)Sending interrupt signal to process After 10s process did not stop java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108) at java.nio.file.Files.deleteIfExists(Files.java:1165) at hudson.Util.tryOnceDeleteFile(Util.java:316) at hudson.Util.deleteFile(Util.java:272) Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:955) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at hudson.FilePath.deleteRecursive(FilePath.java:1266) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597' . Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. at hudson.Util.deleteFile(Util.java:277) at hudson.FilePath.deleteRecursive(FilePath.java:1303) at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312) at hudson.FilePath.deleteRecursive(FilePath.java:1302) at hudson.FilePath.access$1600(FilePath.java:211) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084) at hudson.remoting.UserRequest.perform(UserRequest.java:212) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:748)[Pipeline] }[Pipeline] }[Pipeline] // timeout[Pipeline] // timeout[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn 't honor timeout setting?[Pipeline] }[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn' t honor timeout setting?[Pipeline] }[Pipeline] // dir[Pipeline] // dir[Pipeline] }[Pipeline] }[Pipeline] // node[Pipeline] // node[Pipeline] }[Pipeline] }sh: line 1: 104849 Terminated sleep 3sh: line 1: 163732 Terminated { while [ ( -d /proc/$pid -o ! -d /proc/$$ ) -a -d '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03' -a ! -f '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-result.txt' ]; do touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' ; sleep 3; done; } sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot- default -rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes. P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely. diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java index 9b449d7..b338690 100644 --- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step { } } boolean directory; - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) { directory = ws.isDirectory(); } catch (Exception x) { getWorkspaceProblem(x); @@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step { stopTask = null ; if (recurrencePeriod > 0) { recurrencePeriod = 0; - listener().getLogger().println( "After 10s process did not stop" ); + listener().getLogger().println( "After 60s process did not stop" ); getContext().onFailure(cause); try { FilePath workspace = getWorkspace(); @@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step { } } } - }, 10, TimeUnit.SECONDS); + }, 60, TimeUnit.SECONDS); controller.stop(workspace, launcher()); } else { listener().getLogger().println( "Could not connect to " + node + " to send interrupt signal to process" ); @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step { return ; // slave not yet ready, wait for another day } TaskListener listener = listener(); - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) { if (watching) { Integer exitCode = controller.exitStatus(workspace, launcher(), listener); if (exitCode == null ) {
            Hide
            dnusbaum Devin Nusbaum added a comment -

            Andrew a did those timeouts end up helping? If so, I can roll them up into https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 and release that so they can be configured without needing to run custom code.

            Show
            dnusbaum Devin Nusbaum added a comment - Andrew a did those timeouts end up helping? If so, I can roll them up into https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90  and release that so they can be configured without needing to run custom code.
            Hide
            ncrmnt Andrew a added a comment -

            Devin Nusbaum Sorry for not reporting earlier. 60 seconds seem to have fixed all issues for me. The rest of the problems were due to ctest (and our numa scheduler wrapped within it before the actual simulator) not correctly dying when jenkins asked them to do so.

            Show
            ncrmnt Andrew a added a comment - Devin Nusbaum Sorry for not reporting earlier. 60 seconds seem to have fixed all issues for me. The rest of the problems were due to ctest (and our numa scheduler wrapped within it before the actual simulator) not correctly dying when jenkins asked them to do so.
            Hide
            dnusbaum Devin Nusbaum added a comment -

            Andrew a No problem! I will move forward with my PR (adding an additional timeout), thanks so much for interactively debugging the issue!

            Show
            dnusbaum Devin Nusbaum added a comment - Andrew a No problem! I will move forward with my PR (adding an additional timeout), thanks so much for interactively debugging the issue!
            Hide
            dnusbaum Devin Nusbaum added a comment -

            As of version 2.29 of the Pipeline Nodes and Process Plugin, the default timeout for remote calls is 20 seconds, and the value can be configured using the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT.

            I am marking this ticket as closed since that is the main cause of the issue identified from discussion in the comments (thanks Andrew a!). If this issue is still occurring frequently for someone after increasing that value, please comment and we can investigate further.

            Show
            dnusbaum Devin Nusbaum added a comment - As of version 2.29 of the Pipeline Nodes and Process Plugin, the default timeout for remote calls is 20 seconds, and the value can be configured using the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT . I am marking this ticket as closed since that is the main cause of the issue identified from discussion in the comments (thanks Andrew a !). If this issue is still occurring frequently for someone after increasing that value, please comment and we can investigate further.

              People

              • Assignee:
                dnusbaum Devin Nusbaum
                Reporter:
                totoroliu Rick Liu
              • Votes:
                19 Vote for this issue
                Watchers:
                27 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: