Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-31050

Slave goes offline during the build

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Blocker
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
      None
    • Similar Issues:

      Description

      The slave goes offline during the job execution and throws the error as mentioned below

      Slave went offline during the build
      01:20:15 ERROR: Connection was broken: java.io.EOFException
      01:20:15 at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:613)
      01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      01:20:15 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
      01:20:15 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      01:20:15 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      01:20:15 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      01:20:15 at java.lang.Thread.run(Thread.java:724)
      01:20:15

        Attachments

          Issue Links

            Activity

            Hide
            lukerichardson Luke Richardson added a comment -

            In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded.

            The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master (http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240.

            During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.

            Show
            lukerichardson Luke Richardson added a comment - In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded. The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master ( http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html ). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240. During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

            Show
            oleg_nenashev Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.
            Hide
            shrapm shraddha Magar added a comment - - edited

            I am aslo facing the same issue of agent went offline during build.

            I am using jenkins v2.105 and jre 1.8

            I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know.

            Thanks in advance.

            Show
            shrapm shraddha Magar added a comment - - edited I am aslo facing the same issue of agent went offline during build. I am using jenkins v2.105 and jre 1.8 I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know. Thanks in advance.
            Hide
            pgodithi Prudhvi Godithi added a comment - - edited

            Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what Raghu Pallikonda has mentioned above, any solution for this, please let me know.
            Thank you 
            Slave Verion:

            remoting-3.20.jar

            Error:

            hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.

             

            Should I upgrade the remoting to latest version?

            Show
            pgodithi Prudhvi Godithi added a comment - - edited Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what Raghu Pallikonda has mentioned above, any solution for this, please let me know. Thank you  Slave Verion: remoting-3.20.jar Error: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.   Should I upgrade the remoting to latest version?
            Hide
            th3mis The th3mis added a comment -

            Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent.

            TLDR:  Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0,  signal = SIGTERM) will crash Jenkins agent too.

             

            PID   PGID  SID   TPGID COMMAND
            13691 13691 49864 13691 java -jar agent.jar
            13818 13691 49864 13691  \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh
            13820 13691 49864 13691    \_ kill(0, SIGTERM)

             

            **I propose some agent demonization for except such bug  (call setsid() in thread pool?)

            Description:

            For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.

            Show
            th3mis The th3mis added a comment - Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent. TLDR:   Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0,  signal = SIGTERM) will crash Jenkins agent too.   PID PGID SID TPGID COMMAND 13691 13691 49864 13691 java -jar agent.jar 13818 13691 49864 13691 \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh 13820 13691 49864 13691 \_ kill(0, SIGTERM)   **I propose some agent demonization for except such bug   (call setsid() in thread pool?) Description: For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.

              People

              • Assignee:
                Unassigned
                Reporter:
                nutcracker66 Sujith Dinakar
              • Votes:
                23 Vote for this issue
                Watchers:
                30 Start watching this issue

                Dates

                • Created:
                  Updated: