Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49707

Auto retry for elastic agents after channel closure

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      While my pipeline was running, the node that was executing logic terminated. I see this at the bottom of my console output:

      Cannot contact ip-172-31-242-8.us-west-2.compute.internal: java.io.IOException: remote file operation failed: /ebs/jenkins/workspace/common-pipelines-nodeploy at hudson.remoting.Channel@48503f20:ip-172-31-242-8.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ip-172-31-242-8.us-west-2.compute.internal failed. The channel is closing down or has closed down
      

      There's a spinning arrow below it.

      I have a cron script that uses the Jenkins master CLI to remove nodes which have stopped responding. When I examine this node's page in my Jenkins website, it looks like the node is still running that job and i see an orange label that says "Feb 22, 2018 5:16:02 PM Node is being removed".

      I'm wondering what would be a better way to say "If the channel closes down, retry the work on another node with the same label?

      Things seem stuck. Please advise.

        Attachments

        1. grub.remoting.logs.zip
          3 kB
        2. grubSystemInformation.html
          67 kB
        3. image-2018-02-22-17-27-31-541.png
          image-2018-02-22-17-27-31-541.png
          56 kB
        4. image-2018-02-22-17-28-03-053.png
          image-2018-02-22-17-28-03-053.png
          30 kB
        5. JavaMelodyGrubHeapDump_4_07_18.pdf
          220 kB
        6. JavaMelodyNodeGrubThreads_4_07_18.pdf
          9 kB
        7. jenkins_agent_devbuild9_remoting_logs.zip
          4 kB
        8. jenkins_Agent_devbuild9_System_Information.html
          66 kB
        9. jenkins_agents_Thread_dump.html
          172 kB
        10. jenkins_support_2018-06-29_01.14.18.zip
          1.26 MB
        11. jenkins.log
          984 kB
        12. jobConsoleOutput.txt
          12 kB
        13. jobConsoleOutput.txt
          12 kB
        14. MonitoringJavaelodyOnNodes.html
          44 kB
        15. NetworkAndMachineStats.png
          NetworkAndMachineStats.png
          224 kB
        16. slaveLogInMaster.grub.zip
          8 kB
        17. support_2018-07-04_07.35.22.zip
          956 kB
        18. threadDump.txt
          98 kB
        19. Thread dump [Jenkins].html
          219 kB

          Issue Links

            Activity

            Hide
            terma Artem Stasiuk added a comment -

            For the first one can we use:

            smt like

            @Override
            public void taskCompleted(Executor executor, Queue.Task task, long durationMS) {
                super.taskCompleted(executor, task, durationMS);
                if (isOffline() && getOfflineCause() != null) {
                    System.out.println("Opa, try to resubmit");
                    Queue.getInstance().schedule(task, 10);
                }
            }
            
            Show
            terma Artem Stasiuk added a comment - For the first one can we use: smt like @Override public void taskCompleted(Executor executor, Queue.Task task, long durationMS) { super .taskCompleted(executor, task, durationMS); if (isOffline() && getOfflineCause() != null ) { System .out.println( "Opa, try to resubmit" ); Queue.getInstance().schedule(task, 10); } }
            Hide
            orgoz Olivier Boudet added a comment -

            This issue appears in the release note of kubernetes plugin 1.17.0, so I assume it should be fixed ?

            I upgraded to 1.17.1 and I always encounter it.

            My job is blocked for more than one hour on this error :

             

            Cannot contact openjdk8-slave-5vff7: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 10.8.4.28/10.8.4.28:35920 failed. The channel is closing down or has closed down 
            

            The slave pod has been evicted by k8s :

             

            $ kubectl -n tools describe pods openjdk8-slave-5vff7
            ....
            Normal Started 57m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Started container
            Warning Evicted 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 The node was low on resource: memory. Container jnlp was using 4943792Ki, which exceeds its request of 0.
            Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://openjdk:Need to kill Pod
            Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://jnlp:Need to kill Pod
            

             

             

             

            Show
            orgoz Olivier Boudet added a comment - This issue appears in the release note of kubernetes plugin 1.17.0, so I assume it should be fixed ? I upgraded to 1.17.1 and I always encounter it. My job is blocked for more than one hour on this error :   Cannot contact openjdk8-slave-5vff7: hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on JNLP4-connect connection from 10.8.4.28/10.8.4.28:35920 failed. The channel is closing down or has closed down  The slave pod has been evicted by k8s :   $ kubectl -n tools describe pods openjdk8-slave-5vff7 .... Normal Started 57m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Started container Warning Evicted 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 The node was low on resource: memory. Container jnlp was using 4943792Ki, which exceeds its request of 0. Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker: //openjdk:Need to kill Pod Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker: //jnlp:Need to kill Pod      
            Hide
            jglick Jesse Glick added a comment -

            Olivier Boudet subcase #3 as above should be addressed in recent releases: if an agent pod is deleted then the corresponding build should abort in a few minutes. There is not currently any logic which would do the same after a PodPhase: Failed. That would be a new RFE.

            Show
            jglick Jesse Glick added a comment - Olivier Boudet subcase #3 as above should be addressed in recent releases: if an agent pod is deleted then the corresponding build should abort in a few minutes. There is not currently any logic which would do the same after a PodPhase: Failed . That would be a new RFE.
            Hide
            piratejohnny Jon B added a comment -

            Jesse Glick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on https://issues.jenkins-ci.org/browse/JENKINS-36013 appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed:

            Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is:
            Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body
            The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers): 
            pipeline

            { agent \{ label 'universal' }

            ...
            This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is:
            org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
            This error was the result of the following code:
            post {
            always {
            sh """|#!/bin/bash

            set -x
            docker ps -a -q xargs --no-run-if-empty docker rm -f true
            """.stripMargin()
            ...
            Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013, this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea.

            Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability.

            Keep up the great work you are all doing. This is great.

            Show
            piratejohnny Jon B added a comment - Jesse Glick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on  https://issues.jenkins-ci.org/browse/JENKINS-36013  appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed: Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is: Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers):  pipeline { agent \{ label 'universal' } ... This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is: org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing This error was the result of the following code: post { always { sh """|#!/bin/bash set -x docker ps -a -q xargs --no-run-if-empty docker rm -f true """.stripMargin() ... Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013 , this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea. Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability. Keep up the great work you are all doing. This is great.
            Hide
            jglick Jesse Glick added a comment -

            The MissingContextVariableException is tracked by JENKINS-58900. That is just a bad error message, though; the point is that the node is gone.

            if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node

            Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like

            while (true) {
              try {
                node('spotty') {
                  sh '…'
                }
                break
              } catch (x) {
                if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException &&
                    x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) {
                  continue
                } else {
                  throw x
                }
              }
            }
            
            Show
            jglick Jesse Glick added a comment - The MissingContextVariableException is tracked by JENKINS-58900 . That is just a bad error message, though; the point is that the node is gone. if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like while ( true ) { try { node( 'spotty' ) { sh '…' } break } catch (x) { if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException && x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) { continue } else { throw x } } }

              People

              • Assignee:
                Unassigned
                Reporter:
                piratejohnny Jon B
              • Votes:
                26 Vote for this issue
                Watchers:
                39 Start watching this issue

                Dates

                • Created:
                  Updated: