Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45036

Pipeline job hangs with remote file operation failed / channel is already closed after master restart

    Details

    • Similar Issues:

      Description

      In part, I'm reporting this because I don't know where to begin.

      I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

      15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
      15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
      15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
      

      The Linux agent in question is launched by SSH on Debian Jessie.
      The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

      I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

      In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

      13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
      13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
      hudson.remoting.Channel.terminate(Channel.java:896),
      org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
      org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
      org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
      org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
      org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
      org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
      org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
      hudson.remoting.Channel.close(Channel.java:1295),
      hudson.remoting.Channel.close(Channel.java:1263),
      hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
      hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
      hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
      jenkins.model.Jenkins.access$2000(Jenkins.java:307),
      jenkins.model.Jenkins$22.run(Jenkins.java:3340),
      hudson.model.Queue._withLock(Queue.java:1334),
      hudson.model.Queue.withLock(Queue.java:1211),
      jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
      jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
      hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
      jenkins.model.Jenkins$26.run(Jenkins.java:4196),
      ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
      hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
      hudson.remoting.Request.call(Request.java:172),
      hudson.remoting.Channel.call(Channel.java:829),
      hudson.FilePath.act(FilePath.java:985),
      hudson.FilePath.act(FilePath.java:974),
      hudson.FilePath.archive(FilePath.java:456),
      org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
      org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
      org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
      org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
      hudson.security.ACL.impersonate(ACL.java:260),
      org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
      java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
      java.util.concurrent.FutureTask.run(FutureTask.java:262),
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
      java.lang.Thread.run(Thread.java:745)]
      

      But normally it just seems to fail immediately on resume.

      After this, all the parallel branches hang, and have to be killed with the two-stage attempt to cancel job, then click the prompt in the console output.

      Most of the job is running one batch script / shell script or another, and it's almost always returning from one of these where the failure occurs.

      I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

      I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.

        Attachments

          Activity

          Hide
          jglick Jesse Glick added a comment -

          Some sort of problem with your agent connection. Unless it is reproducible from scratch, there is not much else to say. Unfortunately Remoting offers poor diagnostic capabilities currently.

          Show
          jglick Jesse Glick added a comment - Some sort of problem with your agent connection. Unless it is reproducible from scratch, there is not much else to say. Unfortunately Remoting offers poor diagnostic capabilities currently.
          Hide
          philmcardlecg Phil McArdle added a comment -

          For the record, I haven't been able to reproduce this, so if you don't need the bug for any other reason it can be closed.

          Show
          philmcardlecg Phil McArdle added a comment - For the record, I haven't been able to reproduce this, so if you don't need the bug for any other reason it can be closed.
          Hide
          jglick Jesse Glick added a comment -

          Unfortunately there is probably nothing to be done here until diagnostics have been improved.

          Show
          jglick Jesse Glick added a comment - Unfortunately there is probably nothing to be done here until diagnostics have been improved.
          Hide
          shawnzhesun shawn sun added a comment -

          Jesse Glick we are also experiencing the same problem with latest Jenkins version + plugin version.

          It will hang there forever until time-out if Jenkins master is unable to contact any of the worker

          Show
          shawnzhesun shawn sun added a comment - Jesse Glick we are also experiencing the same problem with latest Jenkins version + plugin version. It will hang there forever until time-out if Jenkins master is unable to contact any of the worker
          Hide
          jglick Jesse Glick added a comment -

          It is possible the fix of JENKINS-36013 improved the situation. Without steps to reproduce from scratch, that is just a guess.

          Show
          jglick Jesse Glick added a comment - It is possible the fix of  JENKINS-36013  improved the situation. Without steps to reproduce from scratch, that is just a guess.

            People

            • Assignee:
              Unassigned
              Reporter:
              philmcardlecg Phil McArdle
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: