Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49707

Auto retry for elastic agents after channel closure

    Details

    • Similar Issues:

      Description

      While my pipeline was running, the node that was executing logic terminated. I see this at the bottom of my console output:

      Cannot contact ip-172-31-242-8.us-west-2.compute.internal: java.io.IOException: remote file operation failed: /ebs/jenkins/workspace/common-pipelines-nodeploy at hudson.remoting.Channel@48503f20:ip-172-31-242-8.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ip-172-31-242-8.us-west-2.compute.internal failed. The channel is closing down or has closed down
      

      There's a spinning arrow below it.

      I have a cron script that uses the Jenkins master CLI to remove nodes which have stopped responding. When I examine this node's page in my Jenkins website, it looks like the node is still running that job and i see an orange label that says "Feb 22, 2018 5:16:02 PM Node is being removed".

      I'm wondering what would be a better way to say "If the channel closes down, retry the work on another node with the same label?

      Things seem stuck. Please advise.

        Attachments

        1. grub.remoting.logs.zip
          3 kB
        2. grubSystemInformation.html
          67 kB
        3. image-2018-02-22-17-27-31-541.png
          image-2018-02-22-17-27-31-541.png
          56 kB
        4. image-2018-02-22-17-28-03-053.png
          image-2018-02-22-17-28-03-053.png
          30 kB
        5. JavaMelodyGrubHeapDump_4_07_18.pdf
          220 kB
        6. JavaMelodyNodeGrubThreads_4_07_18.pdf
          9 kB
        7. jenkins_agent_devbuild9_remoting_logs.zip
          4 kB
        8. jenkins_Agent_devbuild9_System_Information.html
          66 kB
        9. jenkins_agents_Thread_dump.html
          172 kB
        10. jenkins_support_2018-06-29_01.14.18.zip
          1.26 MB
        11. jenkins.log
          984 kB
        12. jobConsoleOutput.txt
          12 kB
        13. jobConsoleOutput.txt
          12 kB
        14. MonitoringJavaelodyOnNodes.html
          44 kB
        15. NetworkAndMachineStats.png
          NetworkAndMachineStats.png
          224 kB
        16. slaveLogInMaster.grub.zip
          8 kB
        17. support_2018-07-04_07.35.22.zip
          956 kB
        18. threadDump.txt
          98 kB
        19. Thread dump [Jenkins].html
          219 kB

          Issue Links

            Activity

            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Not sure what is the request here. You get a system message from Remoting, it's not related to Pipeline or Jobs at all in general. If you want to implement retry features or document the suggestions, IMHO it is on the Pipeline side

            Show
            oleg_nenashev Oleg Nenashev added a comment - Not sure what is the request here. You get a system message from Remoting, it's not related to Pipeline or Jobs at all in general. If you want to implement retry features or document the suggestions, IMHO it is on the Pipeline side
            Hide
            piratejohnny Jon B added a comment -

            Oleg Nenashev I'm not sure how to handle this situation. The problem I need to overcome is that my pipeline hangs with the error message I have screenshotted above. I would much prefer that it errors out and fails. Unfortunately, the pipeline keeps running indefinately.

            Can this instead be configured to throw a catchable exception?

            Show
            piratejohnny Jon B added a comment - Oleg Nenashev I'm not sure how to handle this situation. The problem I need to overcome is that my pipeline hangs with the error message I have screenshotted above. I would much prefer that it errors out and fails. Unfortunately, the pipeline keeps running indefinately. Can this instead be configured to throw a catchable exception?
            Hide
            piratejohnny Jon B added a comment -

            Oleg Nenashev Should this be redesignated a remoting bug? I'm not sure how to unblock my pipelines that are hanging from this issue.

            Show
            piratejohnny Jon B added a comment - Oleg Nenashev Should this be redesignated a remoting bug? I'm not sure how to unblock my pipelines that are hanging from this issue.
            Hide
            piratejohnny Jon B added a comment -

            i just changed the JIRA "component" field for this to "remoting".

            Show
            piratejohnny Jon B added a comment - i just changed the JIRA "component" field for this to "remoting".
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Please provide the following info:

            You can find some pointers here: https://speakerdeck.com/onenashev/day-of-jenkins-2017-dealing-with-agent-connectivity-issues?slide=51

            Show
            oleg_nenashev Oleg Nenashev added a comment - Please provide the following info: Support bundle for the timeframe of the outage: https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin Agent threaddump dump for the timeframe of the outage Agent filesystem log for the time of the outage You can find some pointers here: https://speakerdeck.com/onenashev/day-of-jenkins-2017-dealing-with-agent-connectivity-issues?slide=51
            Hide
            iamshital Shital Savekar added a comment -

            I would like to increase the Priority of this issue to "Major" since this issue is affecting a lot of users.

            Show
            iamshital Shital Savekar added a comment - I would like to increase the Priority of this issue to " Major " since this issue is affecting a lot of users.
            Hide
            slaughter550 Alex Slaughter added a comment -

            We have also been greatly effected by this issue. A resolution would be very nice

            Show
            slaughter550 Alex Slaughter added a comment - We have also been greatly effected by this issue. A resolution would be very nice
            Hide
            edupo Eduardo Lezcano added a comment -

            We are receiving this message sporadically in cloud nodes in Azure managed by Jenkins.

            Show
            edupo Eduardo Lezcano added a comment - We are receiving this message sporadically in cloud nodes in Azure managed by Jenkins.
            Hide
            fnaum Federico Naum added a comment -

            Hi, 

            We are losing at a team to TeamCity mostly for this remoting issue

            Here are the logs requested, the disconnection happened at 10:21 am (agent devbuild9)

            jenkins_agent_devbuild9_remoting_logs.zip

            jenkins_agents_Thread_dump.html

            jenkins_Agent_devbuild9_System_Information.html

            jenkins_support_2018-06-29_01.14.18.zip

            I will appreciate any pointer, to where I can start looking for more information. Let me know if you need more logs. 

            This is happening several times daily so I can provide more logs if needed

             

            Show
            fnaum Federico Naum added a comment - Hi,  We are losing at a team to TeamCity mostly for this remoting issue Here are the logs requested, the disconnection happened at 10:21 am (agent devbuild9) jenkins_agent_devbuild9_remoting_logs.zip jenkins_agents_Thread_dump.html jenkins_Agent_devbuild9_System_Information.html jenkins_support_2018-06-29_01.14.18.zip I will appreciate any pointer, to where I can start looking for more information. Let me know if you need more logs.   This is happening several times daily so I can provide more logs if needed  
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            At least we have some data for diagnostics now

            Show
            oleg_nenashev Oleg Nenashev added a comment - At least we have some data for diagnostics now
            Hide
            fnaum Federico Naum added a comment - - edited

            This is a fresher issue, with fewer things going on, this time the agent that got disconnected is called grub

            Job console output shows (jobConsoleOutput.txt) show at 17:27:54
             
             

            hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on grub failed. The channel is closing down or has closed down  
              at hudson.remoting.Channel.call(Channel.java:948) 
              at hudson.FilePath.act(FilePath.java:1089) 
              at hudson.FilePath.act(FilePath.java:1078)  
               .....  
            17:27:55 ERROR: Issue with creating launcher for agent grub. The agent has not been fully initialized yet

             
             
             jenkins master log at that time (jenkins.log) shows the following lines:
             

            Jul 04, 2018 5:27:54 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
            SEVERE: I/O error in channel grub
            java.io.IOException: Unexpected termination of the channel
            at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
            Caused by: java.io.EOFException
            at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2328)
            at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
            at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
                    ... [trimmed stacktrace]
            
            Jul 04, 2018 5:27:55 PM hudson.model.Slave reportLauncherCreateError
            WARNING: Issue with creating launcher for agent grub. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
            java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet
            at hudson.model.Slave.reportLauncherCreateError(Slave.java:524)
            at hudson.model.Slave.createLauncher(Slave.java:496)
                    ... [trimmed stacktrace]
             
            Jul 04, 2018 5:27:55 PM hudson.model.Slave reportLauncherCreateError
            WARNING: Issue with creating launcher for agent grub. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
            java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet
            at hudson.model.Slave.reportLauncherCreateError(Slave.java:524)
            at hudson.model.Slave.createLauncher(Slave.java:496)
                    ... [trimmed stacktrace]
             
            Jul 04, 2018 5:27:55 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol
            INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path?
            Jul 04, 2018 5:27:55 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
            INFO: rndtest_vortexLibrary/master #289 completed: ABORTED
              

             
            The agent remoting log  that shows the error is the file created at 5:08 pm  (remoting.log.2 inside grub.remoting.logs.zip)
             
             

            at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
            Caused by: java.io.EOFException
            at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2675)
            at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3150)
            at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:859)
            at java.io.ObjectInputStream.<init>(ObjectInputStream.java:355)
            at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
            at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
            at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
            
             

             


             
            but it does not have a timestamp in the message. it would be handy to have one. because I can not work out if the agent or Jenkins master initiated the disconnection.
             
            I've also included 

            • The full support log (support_2018-07-04_07.35.22.zip)
            • The logs under ${JENKINS_HOME}/logs/slaves/grub (slaveLogInMaster.grub.zip)
            • Agent system Information that I grub just minutes after seeing the disconnection.
                  - System Information (grubSystemInformation.html)
                  - Heap Dump (JavaMelodyGrubHeapDump_4_07_18.pdf)
                  - threads (JavaMelodyNodeGrubThreads_4_07_18.pdf)
                  - (MonitoringJavaelodyOnNodes.html)
               A screenshot (*NetworkAndMachineStats.png) of the stats of the master (jenkinssecure1) and the agent (grub)  showing the network activity, memory and CPU history. Hardly anything going on both machines. 
               
               
               
               
               
               
               
               
               
               
            Show
            fnaum Federico Naum added a comment - - edited This is a fresher issue, with fewer things going on, this time the agent that got disconnected is called grub Job console output shows ( jobConsoleOutput.txt ) show at  17:27:54     hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on grub failed. The channel is closing down or has closed down at hudson.remoting.Channel.call(Channel.java:948) at hudson.FilePath.act(FilePath.java:1089) at hudson.FilePath.act(FilePath.java:1078)   ..... 17:27:55 ERROR: Issue with creating launcher for agent grub. The agent has not been fully initialized yet      jenkins master log at that time ( jenkins.log ) shows the following lines:   Jul 04, 2018 5:27:54 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel grub java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2328) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)         ... [trimmed stacktrace] Jul 04, 2018 5:27:55 PM hudson.model.Slave reportLauncherCreateError WARNING: Issue with creating launcher for agent grub. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet at hudson.model.Slave.reportLauncherCreateError(Slave.java:524) at hudson.model.Slave.createLauncher(Slave.java:496)         ... [trimmed stacktrace]   Jul 04, 2018 5:27:55 PM hudson.model.Slave reportLauncherCreateError WARNING: Issue with creating launcher for agent grub. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet at hudson.model.Slave.reportLauncherCreateError(Slave.java:524) at hudson.model.Slave.createLauncher(Slave.java:496)         ... [trimmed stacktrace]   Jul 04, 2018 5:27:55 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? Jul 04, 2018 5:27:55 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish INFO: rndtest_vortexLibrary/master #289 completed: ABORTED      The agent remoting log  that shows the error is the file created at 5:08 pm   ( remoting.log.2  inside  grub.remoting.logs.zip )     at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2675) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3150) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:859) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:355) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)       but it does not have a timestamp in the message. it would be handy to have one. because I can not work out if the agent or Jenkins master initiated the disconnection.   I've also included  The full support log ( support_2018-07-04_07.35.22.zip ) The logs under ${JENKINS_HOME}/logs/slaves/grub ( slaveLogInMaster.grub.zip ) Agent system Information that I grub just minutes after seeing the disconnection.     - System Information ( grubSystemInformation.html )     - Heap Dump ( JavaMelodyGrubHeapDump_4_07_18.pdf )     - threads ( JavaMelodyNodeGrubThreads_4_07_18.pdf )     - ( MonitoringJavaelodyOnNodes.html )  A screenshot (*NetworkAndMachineStats.png)  of the stats of the master (jenkinssecure1) and the agent (grub)  showing the network activity, memory and CPU history. Hardly anything going on both machines.                     
            Show
            fnaum Federico Naum added a comment - Here are the files: jobConsoleOutput.txt grub.remoting.logs.zip JavaMelodyGrubHeapDump_4_07_18.pdf JavaMelodyNodeGrubThreads_4_07_18.pdf MonitoringJavaelodyOnNodes.html grubSystemInformation.html Thread dump [Jenkins].html support_2018-07-04_07.35.22.zip slaveLogInMaster.grub.zip jenkins.log
            Hide
            tom_ghyselinck Tom Ghyselinck added a comment -

            Hi Oleg Nenashev,

            Do you have any update on this?

            We have seen similar issues: The Jenkins Pipeline hangs when the node becomes unreachable at some point in time.

            It would be great to see this fixed. This issue sometimes blocks many jobs in the queue of our CI.

            In this case it was an intermittent networking issue:

            19:35:55 [Sat Jul 14 17:35:54 2018] Waiting for impl_1 to finish...
            19:37:32 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194:  4860 Killed                  "$RDI_PROG" "$@"
            19:37:32 Makefile:423: recipe for target '../../work/projects/dev1/dev1.sdk/dev1.hdf' failed
            19:37:32 make: *** [../../work/projects/dev1/dev1.sdk/dev1.hdf] Error 137
            19:37:32 make: *** Waiting for unfinished jobs....
            19:38:06 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194:  4859 Killed                  "$RDI_PROG" "$@"
            19:38:06 Makefile:423: recipe for target '../../work/projects/dev0/dev0.sdk/dev0.hdf' failed
            19:38:06 make: *** [../../work/projects/dev0/dev0.sdk/dev0.hdf] Error 137
            19:39:59 Cannot contact ubuntu-16-04-amd64-2: java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA/build/fpga-projects/build at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
            19:40:30 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194:  4861 Killed                  "$RDI_PROG" "$@"
            19:40:30 Makefile:423: recipe for target '../../work/projects/dev2/dev2.sdk/dev2.hdf' failed
            19:40:30 make: *** [../../work/projects/dev2/dev2.sdk/dev2.hdf] Error 137
            

            finally, we aborted the build:

            Aborted by me
            09:41:29 Sending interrupt signal to process
            09:41:39 After 10s process did not stop
            

            Please note that in the post steps, we see the errors occur but the build no longer hangs here:

            Error when executing always post condition:
            java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA/packages at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
            	at hudson.FilePath.act(FilePath.java:1043)
            	at hudson.FilePath.act(FilePath.java:1025)
            	at hudson.FilePath.mkdirs(FilePath.java:1213)
            	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79)
            	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67)
            	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:50)
            	at hudson.security.ACL.impersonate(ACL.java:290)
            	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:47)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at java.lang.Thread.run(Thread.java:748)
            Caused by: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
            	at hudson.remoting.Channel.call(Channel.java:948)
            	at hudson.FilePath.act(FilePath.java:1036)
            	... 12 more
            Caused by: java.io.IOException: Unexpected termination of the channel
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
            Caused by: java.io.EOFException
            	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679)
            	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154)
            	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
            	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
            	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
            	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
            
            [Pipeline] cleanWs
            Error when executing cleanup post condition:
            java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
            	at hudson.FilePath.act(FilePath.java:1043)
            	at hudson.FilePath.act(FilePath.java:1025)
            	at hudson.FilePath.mkdirs(FilePath.java:1213)
            	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79)
            	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67)
            	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:50)
            	at hudson.security.ACL.impersonate(ACL.java:290)
            	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:47)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at java.lang.Thread.run(Thread.java:748)
            Caused by: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
            	at hudson.remoting.Channel.call(Channel.java:948)
            	at hudson.FilePath.act(FilePath.java:1036)
            	... 12 more
            Caused by: java.io.IOException: Unexpected termination of the channel
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
            Caused by: java.io.EOFException
            	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679)
            	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154)
            	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
            	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
            	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
            	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
            

            I hope this somewhat helps.

            With best regards,
            Tom.

            Show
            tom_ghyselinck Tom Ghyselinck added a comment - Hi Oleg Nenashev , Do you have any update on this? We have seen similar issues: The Jenkins Pipeline hangs when the node becomes unreachable at some point in time. It would be great to see this fixed. This issue sometimes blocks many jobs in the queue of our CI. In this case it was an intermittent networking issue: 19:35:55 [Sat Jul 14 17:35:54 2018] Waiting for impl_1 to finish... 19:37:32 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194: 4860 Killed "$RDI_PROG" "$@" 19:37:32 Makefile:423: recipe for target '../../work/projects/dev1/dev1.sdk/dev1.hdf' failed 19:37:32 make: *** [../../work/projects/dev1/dev1.sdk/dev1.hdf] Error 137 19:37:32 make: *** Waiting for unfinished jobs.... 19:38:06 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194: 4859 Killed "$RDI_PROG" "$@" 19:38:06 Makefile:423: recipe for target '../../work/projects/dev0/dev0.sdk/dev0.hdf' failed 19:38:06 make: *** [../../work/projects/dev0/dev0.sdk/dev0.hdf] Error 137 19:39:59 Cannot contact ubuntu-16-04-amd64-2: java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA/build/fpga-projects/build at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down 19:40:30 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194: 4861 Killed "$RDI_PROG" "$@" 19:40:30 Makefile:423: recipe for target '../../work/projects/dev2/dev2.sdk/dev2.hdf' failed 19:40:30 make: *** [../../work/projects/dev2/dev2.sdk/dev2.hdf] Error 137 finally, we aborted the build: Aborted by me 09:41:29 Sending interrupt signal to process 09:41:39 After 10s process did not stop Please note that in the post steps, we see the errors occur but the build no longer hangs here: Error when executing always post condition: java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA/packages at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down at hudson.FilePath.act(FilePath.java:1043) at hudson.FilePath.act(FilePath.java:1025) at hudson.FilePath.mkdirs(FilePath.java:1213) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:50) at hudson.security.ACL.impersonate(ACL.java:290) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down at hudson.remoting.Channel.call(Channel.java:948) at hudson.FilePath.act(FilePath.java:1036) ... 12 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) [Pipeline] cleanWs Error when executing cleanup post condition: java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down at hudson.FilePath.act(FilePath.java:1043) at hudson.FilePath.act(FilePath.java:1025) at hudson.FilePath.mkdirs(FilePath.java:1213) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:50) at hudson.security.ACL.impersonate(ACL.java:290) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down at hudson.remoting.Channel.call(Channel.java:948) at hudson.FilePath.act(FilePath.java:1036) ... 12 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) I hope this somewhat helps. With best regards, Tom.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Tom Ghyselinck nope, I don't. I have requested info which is needed to diagnose the issue, but I have never reviewed it. I will unlikely have time for that in short-term, busy with other stuff in the community. Jeff Thompson is the current Remoting default assignee, so I will assign the issue to him.

             

            Show
            oleg_nenashev Oleg Nenashev added a comment - Tom Ghyselinck nope, I don't. I have requested info which is needed to diagnose the issue, but I have never reviewed it. I will unlikely have time for that in short-term, busy with other stuff in the community. Jeff Thompson is the current Remoting default assignee, so I will assign the issue to him.  
            Hide
            tom_ghyselinck Tom Ghyselinck added a comment -

            Hi Oleg Nenashev,

            Thanks!

            P.S. I set the Assignee to "Automatic" and it assigned you, it probably needs a change in the component configuration to set it to Jeff Thompson by default?

            With best regards,
            Tom.

            Show
            tom_ghyselinck Tom Ghyselinck added a comment - Hi Oleg Nenashev , Thanks! P.S. I set the Assignee to " Automatic " and it assigned you, it probably needs a change in the component configuration to set it to Jeff Thompson by default? With best regards, Tom.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            No, I am just a default assignee of the "_unsorted" component which was first in the component list. "remoting" component is configured properly, and I have just removed "_unsorted" for now since we have got the diagnostics info

            Show
            oleg_nenashev Oleg Nenashev added a comment - No, I am just a default assignee of the "_unsorted" component which was first in the component list. "remoting" component is configured properly, and I have just removed "_unsorted" for now since we have got the diagnostics info
            Hide
            fnaum Federico Naum added a comment -

            Has someone with experiencing this issue had a look at this new plugin https://plugins.jenkins.io/remoting-kafka

            Oleg Nenashev I can see you are very involved with it.

            Looks promising, is lacking some documentation, but I'll play with to see if I can get it working, and report back to see If that solve my connection issues.

            Show
            fnaum Federico Naum added a comment - Has someone with experiencing this issue had a look at this new plugin https://plugins.jenkins.io/remoting-kafka Oleg Nenashev I can see you are very involved with it. Looks promising, is lacking some documentation, but I'll play with to see if I can get it working, and report back to see If that solve my connection issues.
            Hide
            piratejohnny Jon B added a comment -

            The repro case here is pretty simple:

            1) Create a parallel job (even a job that just does a sleep)

            2) Terminate the executor's host while its running

            It hangs with this error every time.

             

            Show
            piratejohnny Jon B added a comment - The repro case here is pretty simple: 1) Create a parallel job (even a job that just does a sleep) 2) Terminate the executor's host while its running It hangs with this error every time.  
            Hide
            piratejohnny Jon B added a comment - - edited

            I don't mean to be dramatic but this is literally the biggest problem in all of Jenkins as far as I can tell. If we lose an ec2 host while an executor is doing parallel work, we badly need for the parallel item to restart on another healthy executor. When it just plain hangs, we can't do that and the user experience of hanging is not acceptable.

            I would recommend elevating the urgency here to the highest level to get this triaged.

            Show
            piratejohnny Jon B added a comment - - edited I don't mean to be dramatic but this is literally the biggest problem in all of Jenkins as far as I can tell. If we lose an ec2 host while an executor is doing parallel work, we badly need for the parallel item to restart on another healthy executor. When it just plain hangs, we can't do that and the user experience of hanging is not acceptable. I would recommend elevating the urgency here to the highest level to get this triaged.
            Hide
            piratejohnny Jon B added a comment - - edited

            Repro code:

            def jobs = [:]
            jobs["Do Work"] = getWork()
            parallel jobs
            println "Parallel run completed."
            
            def getWork() {
              return {
                node('general') {
                  sh """|#!/bin/bash
                        |set -ex
                        |echo "going to sleep..."
                        |sleep 300
                        |echo "yay I made it to the end."
                        |""".stripMargin()
                }
              }
            }
             

            To repro, run this pipeline and once the control flow hits the sleep, terminate the executor's host and it will hang with something like this:

            [Do Work] Cannot contact ip-172-31-237-68.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ip-172-31-237-68.us-west-2.compute.internal failed. The channel is closing down or has closed down 

            It hangs with this error every time I try it.

            Show
            piratejohnny Jon B added a comment - - edited Repro code: def jobs = [:] jobs[ "Do Work" ] = getWork() parallel jobs println "Parallel run completed." def getWork() { return { node( 'general' ) { sh """|#!/bin/bash |set -ex |echo "going to sleep..." |sleep 300 |echo "yay I made it to the end." |""".stripMargin() } } } To repro, run this pipeline and once the control flow hits the sleep, terminate the executor's host and it will hang with something like this: [Do Work] Cannot contact ip-172-31-237-68.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on ip-172-31-237-68.us-west-2.compute.internal failed. The channel is closing down or has closed down It hangs with this error every time I try it.
            Hide
            stickycode Michael McCallum added a comment -

            I concur this is a pretty serious issue, I've tried a number or workarounds  like timeouts to restart the job but once it hangs its stuck.

            Show
            stickycode Michael McCallum added a comment - I concur this is a pretty serious issue, I've tried a number or workarounds  like timeouts to restart the job but once it hangs its stuck.
            Hide
            mgreco2k Michael Greco added a comment - - edited

            I've been noticing this for MONTHS. And in case people don't realize the master branch of the docker-plugin wasn't building today 9/17/18 :

            https://ci.jenkins.io/job/Plugins/job/docker-plugin/ 

            Anyways this weekend I loaded docker-plugin build 1.1.5 and today on every build I was getting "The channel is closing down or has closed down" as my jobs would still appear to be running even though obviously the container was gone.

             

            I would up downgrading to an older build I have :  

            1.1.5-SNAPSHOT (private-554bbf8a-win2012-6d34b0$)

            in which the problem seems to happen less. I went so far as to rebuild some of my "build containers" as they are created "FROM jenkinsci/slave" and I noticed that has had an update sometime in August.

             

            Again It made no difference using the "released 1.1.5" version of docker-plugin (every thing wound up in the state of "The channel is closing down or has closed down") and that's when I noticed the master branch isn't building either ... so I just went back to my earlier build.

            Show
            mgreco2k Michael Greco added a comment - - edited I've been noticing this for MONTHS. And in case people don't realize the master branch of the docker-plugin wasn't building today 9/17/18 : https://ci.jenkins.io/job/Plugins/job/docker-plugin/   Anyways this weekend I loaded docker-plugin build 1.1.5 and today on every build I was getting "The channel is closing down or has closed down" as my jobs would still appear to be running even though obviously the container was gone.   I would up downgrading to an older build I have :   1.1.5-SNAPSHOT (private-554bbf8a-win2012-6d34b0$) in which the problem seems to happen less. I went so far as to rebuild some of my "build containers" as they are created "FROM jenkinsci/slave" and I noticed that has had an update sometime in August.   Again It made no difference using the "released 1.1.5" version of docker-plugin (every thing wound up in the state of "The channel is closing down or has closed down") and that's when I noticed the master branch isn't building either ... so I just went back to my earlier build.
            Hide
            piratejohnny Jon B added a comment -

            If left unfixed for much longer, our organization is going to be forced to use another technology for CICD since this is causing widespread pain and confidence lost in this technology among our hundreds of developers who are using Jenkins at our company.

            Show
            piratejohnny Jon B added a comment - If left unfixed for much longer, our organization is going to be forced to use another technology for CICD since this is causing widespread pain and confidence lost in this technology among our hundreds of developers who are using Jenkins at our company.
            Hide
            mgreco2k Michael Greco added a comment - - edited

            Maybe try the LTS? ... uggg I try to be a "start-up" kind of guy ... it sounds like there's maybe some integration tests that need to be part of the project ... If you got access to spinning up another VM maybe launch the LTS version and try it out. I know I keep the jenkins data in a docker volume so moving around between these versions to try stuff out on different docker hosts for exactly these situations is helpful. I'm running 2.140 but maybe the plugin works better with the LTS ? (ok I'm reaching out side the box cause if the plugin has a bug then...)

            Show
            mgreco2k Michael Greco added a comment - - edited Maybe try the LTS? ... uggg I try to be a "start-up" kind of guy ... it sounds like there's maybe some integration tests that need to be part of the project ... If you got access to spinning up another VM maybe launch the LTS version and try it out. I know I keep the jenkins data in a docker volume so moving around between these versions to try stuff out on different docker hosts for exactly these situations is helpful. I'm running 2.140 but maybe the plugin works better with the LTS ? (ok I'm reaching out side the box cause if the plugin has a bug then...)
            Hide
            piratejohnny Jon B added a comment - - edited

            I just met with Jesse Glick who told me that in my case, the underlying mechanism that triggers when I call for "node('general')" selects one of my spot instances. At that moment, an internal ec2 hostname is selected. If, while the work is being performed, that particular node dies, Jenkins intentionally waits for another machine at that hostname to wake up before it will continue. It is for this reason why it appears to hang forever - because my AWS spot/autoscaling does not launch another machine with the same internal hostname.

            He suggested setting a timeout block which would retry the test run if the work does not complete within a given period.

            We both agreed this seems to therefore be a new feature request.

            The new feature would allow Jenkins to re-dispatch the closure of work to any other node that matches the given label if the original executor's host was terminated while the work was being performed.

            Show
            piratejohnny Jon B added a comment - - edited I just met with Jesse Glick who told me that in my case, the underlying mechanism that triggers when I call for "node('general')" selects one of my spot instances. At that moment, an internal ec2 hostname is selected. If, while the work is being performed, that particular node dies, Jenkins intentionally waits for another machine at that hostname to wake up before it will continue. It is for this reason why it appears to hang forever - because my AWS spot/autoscaling does not launch another machine with the same internal hostname. He suggested setting a timeout block which would retry the test run if the work does not complete within a given period. We both agreed this seems to therefore be a new feature request. The new feature would allow Jenkins to re-dispatch the closure of work to any other node that matches the given label if the original executor's host was terminated while the work was being performed.
            Hide
            stickycode Michael McCallum added a comment - - edited

            I tried using a timeout block but it never triggers, does anyone have an example of that working?

            Thats 2.141 running jenkins on k8s with k8s agents. With the latest plugins as of a few days ago.

            Show
            stickycode Michael McCallum added a comment - - edited I tried using a timeout block but it never triggers, does anyone have an example of that working? Thats 2.141 running jenkins on k8s with k8s agents. With the latest plugins as of a few days ago.
            Hide
            mgreco2k Michael Greco added a comment - - edited

            In my case the original node doesn't die ... I'm not using AWS autoscaling ...

            Show
            mgreco2k Michael Greco added a comment - - edited In my case the original node doesn't die ... I'm not using AWS autoscaling ...
            Hide
            fnaum Federico Naum added a comment -

            Agree with Jon B regarding that this is a critical issue. We have one of the teams switching to TeamCity  . In the time being, I'm trying to attack the problem using the new kafka agent plugin. In my tests, it seems quite stable, and I'm not encountering the frequent channel disconnection when running parallel jobs, so I would be deploying that to production this week. 

            I agree as well that the retry in a new node that satisfies the labels can be a different issue, but I would also say that should be top priority.

             

            PS: We are also not using AWS

            Show
            fnaum Federico Naum added a comment - Agree with Jon B regarding that this is a critical issue. We have one of the teams switching to TeamCity  . In the time being, I'm trying to attack the problem using the new kafka agent plugin. In my tests, it seems quite stable, and I'm not encountering the frequent channel disconnection when running parallel jobs, so I would be deploying that to production this week.  I agree as well that the retry in a new node that satisfies the labels can be a different issue, but I would also say that should be top priority.   PS: We are also not using AWS
            Hide
            mgreco2k Michael Greco added a comment - - edited

            This is all fine and well and not to complain but why is the connection going away ? I'll blame myself 1st (that's experience) and say I'm sure I didn't read something ... or maybe missed something that was said in this report ?

             

            It just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" but I'm not seeing my AWS instance die as Jon B is. Can someone enlighten me please ?

             

            Or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ? ... why is the channel prematurely going down in the 1st place ? The "closed channel message" does seem happen during longer running requests.

            Show
            mgreco2k Michael Greco added a comment - - edited This is all fine and well and not to complain but why is the connection going away ? I'll blame myself 1st (that's experience) and say I'm sure I didn't read something ... or maybe missed something that was said in this report ?   It just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" but I'm not seeing my AWS instance die as Jon B is. Can someone enlighten me please ?   Or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ? ... why is the channel prematurely going down in the 1st place ? The "closed channel message" does seem happen during longer running requests.
            Hide
            piratejohnny Jon B added a comment -

            Michael Greco I think the issue leading me to this error message is a different set of circumstances. I'm not using the docker plugin for example. You might want to open a new ticket.

            My case is 100% based on how jenkins is meant to work - its trying to wait for the node that disconnected to come back up. However, in the case of cloud elastic computing, the worker will never come back up and that's why I see the hang. It is for this reason the title was adjusted and also how the ticket is filed.

            Show
            piratejohnny Jon B added a comment - Michael Greco I think the issue leading me to this error message is a different set of circumstances. I'm not using the docker plugin for example. You might want to open a new ticket. My case is 100% based on how jenkins is meant to work - its trying to wait for the node that disconnected to come back up. However, in the case of cloud elastic computing, the worker will never come back up and that's why I see the hang. It is for this reason the title was adjusted and also how the ticket is filed.
            Hide
            mgreco2k Michael Greco added a comment -

            Uggg .. ok ... thanks.

            Mike

            Show
            mgreco2k Michael Greco added a comment - Uggg .. ok ... thanks. Mike
            Hide
            dubrsl Viacheslav Dubrovskyi added a comment -

            Federico Naum How does kafka plugin behave in the event of a node shutdown?

            Show
            dubrsl Viacheslav Dubrovskyi added a comment - Federico Naum How does kafka plugin behave in the event of a node shutdown?
            Hide
            fnaum Federico Naum added a comment -

            There is an issue where Jenkins master does not reflect a kafka agent disconnection (I have logged this issue https://issues.jenkins-ci.org/browse/JENKINS-54001

            • If I reboot an agent and then trigger a build asking for that agent, Jenkins keeps waiting.. and when the agent comes back online it runs the job to completion *.
            • If the agent does not come online, it will eventually time out at some point, fail the build and mark the agent as offline.
            • If I reboot an agent or stop the remoting process while it is running a job in that agent, Jenkins keeps waiting till the agent or the process to get back online, after printing this line:
              Cannot contact AGENTNAME: java.lang.InterruptedException 
              • When it gets back online, then it does fail with 
                wrapper script does not seem to be touching the log file in /var/kafka/jenkins/workspace/demo@tmp/durable-ec4fef48
                (JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)

             

            Even this situation is not ideal. The kafka agents are much more reliable and I do not have the ChannelClosedException when running parallel builds. So for me is more stable even the recovery of an agent shutdown is node ideal.

             

            Note: Does test are with kafka plugin-1.1.1 (and 1.1.3 is out, so I will re-do this test once I upgrade to that latest version

             

            • I wrote this daemon for my centOS-7 setup, so the agent reconnects when it is rebooted or the process dies for some reason:

             

            [Unit]
            Description=Jenkins kafka agent
            After=network.target
            
            [Service]
            Type=simple
            Restart=always
            RestartSec=1
            User=buildboy
            Environment=PATH=/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/bin/X11:/sbin:/usr/local/sbin
            
            ExecStart=/usr/bin/java -jar /var/kafka/remoting-kafka-agent.jar -name AGENTNAME -master http://myjenkinsinstance:8081/ -
            secret 611c91c8013e27b8b00e36d66e421a1743604230862f4d290a87b9426a2b3b1f -kafkaURL kafka:9092 -noauth
            
            [Install]
            WantedBy=multi-user.target
            

             

             

             

             

            Show
            fnaum Federico Naum added a comment - There is an issue where Jenkins master does not reflect a kafka agent disconnection (I have logged this issue https://issues.jenkins-ci.org/browse/JENKINS-54001 )  If I reboot an agent and then trigger a build asking for that agent, Jenkins keeps waiting.. and when the agent comes back online it runs the job to completion *. If the agent does not come online, it will eventually time out at some point, fail the build and mark the agent as offline. If I reboot an agent or stop the remoting process while it is running a job in that agent, Jenkins keeps waiting till the agent or the process to get back online, after printing this line: Cannot contact AGENTNAME: java.lang.InterruptedException  When it gets back online, then it does fail with  wrapper script does not seem to be touching the log file in / var /kafka/jenkins/workspace/demo@tmp/durable-ec4fef48 (JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300) I thought I logged this ticket https://issues.jenkins-ci.org/browse/JENKINS-53397 for this issue, but it a bit different.   Even this situation is not ideal. The kafka agents are much more reliable and I do not have the ChannelClosedException when running parallel builds. So for me is more stable even the recovery of an agent shutdown is node ideal.   Note: Does test are with kafka plugin-1.1.1 (and 1.1.3 is out, so I will re-do this test once I upgrade to that latest version   I wrote this daemon for my centOS-7 setup, so the agent reconnects when it is rebooted or the process dies for some reason:   [Unit] Description=Jenkins kafka agent After=network.target [Service] Type=simple Restart=always RestartSec=1 User=buildboy Environment=PATH=/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/bin/X11:/sbin:/usr/local/sbin ExecStart=/usr/bin/java -jar / var /kafka/remoting-kafka-agent.jar -name AGENTNAME -master http: //myjenkinsinstance:8081/ - secret 611c91c8013e27b8b00e36d66e421a1743604230862f4d290a87b9426a2b3b1f -kafkaURL kafka:9092 -noauth [Install] WantedBy=multi-user.target        
            Hide
            dubrsl Viacheslav Dubrovskyi added a comment -

            Federico Naum thank you for information. It's not solve the main problem. I checked ssh and swarm agents and this problem doesn't depend from agent's type connection.

            The explanation of Jon B that if recreate node with the same hostname and label helped to me. I use GCE for nodes and custom script for add or remove nodes. So I can easy add logic for detect removed nodes and re-add it.
            It's a pity that none of the cloud plugins can't do this.

            Show
            dubrsl Viacheslav Dubrovskyi added a comment - Federico Naum  thank you for information. It's not solve the main problem. I checked ssh and swarm agents and this problem doesn't depend from agent's type connection. The explanation of Jon B  that if recreate node with the same hostname and label helped to me. I use GCE for nodes and custom script for add or remove nodes. So I can easy add logic for detect removed nodes and re-add it. It's a pity that none of the cloud plugins can't do this.
            Hide
            amirbarkal Amir Barkal added a comment - - edited

            The problem is Jenkins not aborting / cancelling / stopping / whatever the build when the agent is terminated in the middle of a build.
            There's an infinite loop that's easy to reproduce:

            1. Start Jenkins slave with remoting jnlp jar:

            java -jar agent.jar -jnlpUrl "http://jenkins:8080/computer/agent1/slave-agent.jnlp" -secret 123
            

            2. Run the following pipeline:

            node('agent1') {
                sh('sleep 100000000')   
            }
            

            3. Kill the agent (Ctrl+C)

            4. Jenkins output in job console log:

            Started by user admin
            Replayed #23
            Running as admin
            Running in Durability level: MAX_SURVIVABILITY
            [Pipeline] node
            Running on agent1-805fa9fd in /workspace/Pipeline-1
            [Pipeline] {
            [Pipeline] sh
            [Pipeline-1] Running shell script
            + sleep 100000000
            Cannot contact agent-805fa9fd: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 2163fbb04240.jenkins/172.20.0.3:43902 failed. The channel is closing down or has closed down
            

            threadDump.txt

            System info:
            Jenkins ver. 2.138
            Durable Task: 1.25

            What I would like is a way to configure a maximum timeout for the Jenkins master to wait for the agent to respond, and then just abort the build. It's absolutely unacceptable that builds will hang due to dead agents.

            Show
            amirbarkal Amir Barkal added a comment - - edited The problem is Jenkins not aborting / cancelling / stopping / whatever the build when the agent is terminated in the middle of a build. There's an infinite loop that's easy to reproduce: 1. Start Jenkins slave with remoting jnlp jar: java -jar agent.jar -jnlpUrl "http: //jenkins:8080/computer/agent1/slave-agent.jnlp" -secret 123 2. Run the following pipeline: node( 'agent1' ) { sh( 'sleep 100000000' ) } 3. Kill the agent (Ctrl+C) 4. Jenkins output in job console log: Started by user admin Replayed #23 Running as admin Running in Durability level: MAX_SURVIVABILITY [Pipeline] node Running on agent1-805fa9fd in /workspace/Pipeline-1 [Pipeline] { [Pipeline] sh [Pipeline-1] Running shell script + sleep 100000000 Cannot contact agent-805fa9fd: hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on JNLP4-connect connection from 2163fbb04240.jenkins/172.20.0.3:43902 failed. The channel is closing down or has closed down threadDump.txt System info: Jenkins ver. 2.138 Durable Task: 1.25 What I would like is a way to configure a maximum timeout for the Jenkins master to wait for the agent to respond, and then just abort the build. It's absolutely unacceptable that builds will hang due to dead agents.
            Hide
            jrogers Jonathan Rogers added a comment -

            Like Amir Barkal, I would like a pipeline step to fail quickly if the Jenkins master loses its connection to the agent for the node running the step. The log mentions that hudson.remoting.ChannelClosedException was thrown. If I can catch that exception in my pipeline script, I can retry the appropriate steps.

            Show
            jrogers Jonathan Rogers added a comment - Like Amir Barkal, I would like a pipeline step to fail quickly if the Jenkins master loses its connection to the agent for the node running the step. The log mentions that hudson.remoting.ChannelClosedException was thrown. If I can catch that exception in my pipeline script, I can retry the appropriate steps.
            Hide
            jspiewak Joshua Spiewak added a comment -

            FWIW, we use the EC2 Fleet Plugin and regularly experience this issue.

            It would be great if agents had an attribute to indicate whether they are durable/long lived or dynamic/transient. That way the channel closure could be handled appropriately for each scenario. At the very least, having a global config to control whether or not agent disconnection was fatal to a build or not would allow pipeline authors to handle the disconnection explicitly, without resorting to putting timeouts in place.

            Show
            jspiewak Joshua Spiewak added a comment - FWIW, we use the EC2 Fleet Plugin and regularly experience this issue. It would be great if agents had an attribute to indicate whether they are durable/long lived or dynamic/transient. That way the channel closure could be handled appropriately for each scenario. At the very least, having a global config to control whether or not agent disconnection was fatal to a build or not would allow pipeline authors to handle the disconnection explicitly, without resorting to putting timeouts in place.
            Hide
            kabylake Troni Dale Atillo added a comment - - edited

            I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back in the middle of the executing node, then our pipeline continue the execution, we got this 

            hudson.remoting.ChannelClosedException: Channel "unknown": .... The channel is closing down or has closed down
             

            I noticed that when the agent was disconnected, the workspace that we are using before the disconnection seems locked when it comes back. Any operation you will do that requires execution in the said workspace seems cause this error. It seems it cannot use that workspace anymore. My script was run in parallel too.

            The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works.

            ws (...){ 
            //other scripts need to be executed after the disconnection 
            }

             

            Show
            kabylake Troni Dale Atillo added a comment - - edited I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back in the middle of the executing node, then our pipeline continue the execution, we got this  hudson.remoting.ChannelClosedException: Channel "unknown" : .... The channel is closing down or has closed down I noticed that when the agent was disconnected, the workspace that we are using before the disconnection seems locked when it comes back. Any operation you will do that requires execution in the said workspace seems cause this error. It seems it cannot use that workspace anymore. My script was run in parallel too. The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works. ws (...){ //other scripts need to be executed after the disconnection }  
            Hide
            jglick Jesse Glick added a comment - - edited

            There are actually several subcases mixed together here.

            1. The originally reported RFE: if something like a spot instance is terminated, we would like to retry the whole node block.
            2. If an agent gets disconnected but continues to be registered in Jenkins, we would like to eventually abort the build. (Not immediately, since sometimes there is just a transient Remoting channel outage or agent JVM crash or whatever; if the agent successfully reconnects, we want to continue processing output from the durable task, which should not have been affected by the outage.)
            3. If an agent goes offline and is removed from the Jenkins configuration, we may as well immediately abort the build, since it is unlikely it would be reattached under the same name with the same processes still running. (Though this can happen when using the Swarm plugin.)
            4. If an agent is removed from the Jenkins configuration and Jenkins is restarted, we may as well abort the build, as in #3.

            #4 was addressed by JENKINS-36013. I filed workflow-durable-task-step #104 for #3. For this to be effective, cloud provider plugins need to actually remove dead agents automatically (at some point); it will take some work to see if this is so, and if not, whether that can be safely changed.

            #2 is possible but a little trickier, since some sort of timeout value needs to be defined.

            #1 would be a rather different implementation and would certainly need to be opt-in (somehow TBD).

            Show
            jglick Jesse Glick added a comment - - edited There are actually several subcases mixed together here. The originally reported RFE: if something like a spot instance is terminated, we would like to retry the whole node block. If an agent gets disconnected but continues to be registered in Jenkins, we would like to eventually abort the build. (Not immediately, since sometimes there is just a transient Remoting channel outage or agent JVM crash or whatever; if the agent successfully reconnects, we want to continue processing output from the durable task, which should not have been affected by the outage.) If an agent goes offline and is removed from the Jenkins configuration, we may as well immediately abort the build, since it is unlikely it would be reattached under the same name with the same processes still running. (Though this can happen when using the Swarm plugin.) If an agent is removed from the Jenkins configuration and Jenkins is restarted, we may as well abort the build, as in #3. #4 was addressed by JENKINS-36013 . I filed workflow-durable-task-step #104 for #3. For this to be effective, cloud provider plugins need to actually remove dead agents automatically (at some point); it will take some work to see if this is so, and if not, whether that can be safely changed. #2 is possible but a little trickier, since some sort of timeout value needs to be defined. #1 would be a rather different implementation and would certainly need to be opt-in (somehow TBD).
            Hide
            terma Artem Stasiuk added a comment -

            For the first one can we use:

            smt like

            @Override
            public void taskCompleted(Executor executor, Queue.Task task, long durationMS) {
                super.taskCompleted(executor, task, durationMS);
                if (isOffline() && getOfflineCause() != null) {
                    System.out.println("Opa, try to resubmit");
                    Queue.getInstance().schedule(task, 10);
                }
            }
            
            Show
            terma Artem Stasiuk added a comment - For the first one can we use: smt like @Override public void taskCompleted(Executor executor, Queue.Task task, long durationMS) { super .taskCompleted(executor, task, durationMS); if (isOffline() && getOfflineCause() != null ) { System .out.println( "Opa, try to resubmit" ); Queue.getInstance().schedule(task, 10); } }
            Hide
            orgoz Olivier Boudet added a comment -

            This issue appears in the release note of kubernetes plugin 1.17.0, so I assume it should be fixed ?

            I upgraded to 1.17.1 and I always encounter it.

            My job is blocked for more than one hour on this error :

             

            Cannot contact openjdk8-slave-5vff7: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 10.8.4.28/10.8.4.28:35920 failed. The channel is closing down or has closed down 
            

            The slave pod has been evicted by k8s :

             

            $ kubectl -n tools describe pods openjdk8-slave-5vff7
            ....
            Normal Started 57m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Started container
            Warning Evicted 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 The node was low on resource: memory. Container jnlp was using 4943792Ki, which exceeds its request of 0.
            Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://openjdk:Need to kill Pod
            Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://jnlp:Need to kill Pod
            

             

             

             

            Show
            orgoz Olivier Boudet added a comment - This issue appears in the release note of kubernetes plugin 1.17.0, so I assume it should be fixed ? I upgraded to 1.17.1 and I always encounter it. My job is blocked for more than one hour on this error :   Cannot contact openjdk8-slave-5vff7: hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on JNLP4-connect connection from 10.8.4.28/10.8.4.28:35920 failed. The channel is closing down or has closed down  The slave pod has been evicted by k8s :   $ kubectl -n tools describe pods openjdk8-slave-5vff7 .... Normal Started 57m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Started container Warning Evicted 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 The node was low on resource: memory. Container jnlp was using 4943792Ki, which exceeds its request of 0. Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker: //openjdk:Need to kill Pod Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker: //jnlp:Need to kill Pod      
            Hide
            jglick Jesse Glick added a comment -

            Olivier Boudet subcase #3 as above should be addressed in recent releases: if an agent pod is deleted then the corresponding build should abort in a few minutes. There is not currently any logic which would do the same after a PodPhase: Failed. That would be a new RFE.

            Show
            jglick Jesse Glick added a comment - Olivier Boudet subcase #3 as above should be addressed in recent releases: if an agent pod is deleted then the corresponding build should abort in a few minutes. There is not currently any logic which would do the same after a PodPhase: Failed . That would be a new RFE.
            Hide
            piratejohnny Jon B added a comment -

            Jesse Glick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on https://issues.jenkins-ci.org/browse/JENKINS-36013 appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed:

            Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is:
            Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body
            The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers): 
            pipeline

            { agent \{ label 'universal' }

            ...
            This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is:
            org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
            This error was the result of the following code:
            post {
            always {
            sh """|#!/bin/bash

            set -x
            docker ps -a -q xargs --no-run-if-empty docker rm -f true
            """.stripMargin()
            ...
            Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013, this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea.

            Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability.

            Keep up the great work you are all doing. This is great.

            Show
            piratejohnny Jon B added a comment - Jesse Glick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on  https://issues.jenkins-ci.org/browse/JENKINS-36013  appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed: Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is: Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers):  pipeline { agent \{ label 'universal' } ... This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is: org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing This error was the result of the following code: post { always { sh """|#!/bin/bash set -x docker ps -a -q xargs --no-run-if-empty docker rm -f true """.stripMargin() ... Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013 , this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea. Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability. Keep up the great work you are all doing. This is great.
            Hide
            jglick Jesse Glick added a comment -

            The MissingContextVariableException is tracked by JENKINS-58900. That is just a bad error message, though; the point is that the node is gone.

            if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node

            Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like

            while (true) {
              try {
                node('spotty') {
                  sh '…'
                }
                break
              } catch (x) {
                if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException &&
                    x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) {
                  continue
                } else {
                  throw x
                }
              }
            }
            
            Show
            jglick Jesse Glick added a comment - The MissingContextVariableException is tracked by JENKINS-58900 . That is just a bad error message, though; the point is that the node is gone. if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like while ( true ) { try { node( 'spotty' ) { sh '…' } break } catch (x) { if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException && x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) { continue } else { throw x } } }
            Hide
            oxygenxo Andrey Babushkin added a comment -

            We use kubernetes plugin with our bare-metal kubernetes cluster and the problem is that pipeline can run indefinitely  if agent inside pod were killed/underlying node was restarted. Is there any option to tweak such behavior, e.g. some timeout settings (except explicit timeout step)?

            Show
            oxygenxo Andrey Babushkin added a comment - We use kubernetes plugin with our bare-metal kubernetes cluster and the problem is that pipeline can run indefinitely  if agent inside pod were killed/underlying node was restarted. Is there any option to tweak such behavior, e.g. some timeout settings (except explicit timeout step)?
            Hide
            jglick Jesse Glick added a comment -

            Andrey Babushkin that should have already been fixed—see linked PRs.

            Show
            jglick Jesse Glick added a comment - Andrey Babushkin that should have already been fixed—see linked PRs.

              People

              • Assignee:
                Unassigned
                Reporter:
                piratejohnny Jon B
              • Votes:
                33 Vote for this issue
                Watchers:
                47 Start watching this issue

                Dates

                • Created:
                  Updated: