Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48865

JNLP Agents/Slaves Disconnecting Unpredictably

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Component/s: remoting
    • Labels:
    • Environment:
      Jenkins Master - 2.100, Ubuntu
      Linux Agent - Running inside a container on Ubuntu, 2.100 agent jar
      Windows Agent - Running inside a container on Windows Server 1709
    • Similar Issues:

      Description

      I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.

      Overall, the agents are able to connect and perform builds through to completion.  Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents.  Especially after they've been idle for a bit.

      I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail.  I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.

      I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing.  One recent disconnect produced this on the linux agent: 

      Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)

      This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again.  Here's what the Jenkins master reports for the agent:

      java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

      This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.

      I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker.  If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.

      I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!

        Attachments

          Issue Links

            Activity

            Hide
            awong29 Alfred Wong added a comment -

            I have an issue very similar to this issue. My observation is that the slave has lost connectivity and tried to re-establish a connection and the master is rejecting the connection because master thinks it already have the connection. While at the same time master is trying to ping the slave and waiting for the 4 minutes timeout. I think the error condition can be handle a bit differently, if ping is not responding and a new connection request is coming in, it should accept the new connection instead of waiting for 4 minutes before destroying the old connection. I have attached a log file from the master. The only thing I am not sure is why the slave needs to request a new connection, maybe because the connection to the master is not very stable. It would be nice to have more slave logs to see why the connection is dropped.

             
            The Jenkins version is 2.150.3 and run under Kunbernetes and the slaves are Windows slaves started using JNLP.

            Show
            awong29 Alfred Wong added a comment - I have an issue very similar to this issue. My observation is that the slave has lost connectivity and tried to re-establish a connection and the master is rejecting the connection because master thinks it already have the connection. While at the same time master is trying to ping the slave and waiting for the 4 minutes timeout. I think the error condition can be handle a bit differently, if ping is not responding and a new connection request is coming in, it should accept the new connection instead of waiting for 4 minutes before destroying the old connection. I have attached a log file from the master. The only thing I am not sure is why the slave needs to request a new connection, maybe because the connection to the master is not very stable. It would be nice to have more slave logs to see why the connection is dropped.   The Jenkins version is 2.150.3 and run under Kunbernetes and the slaves are Windows slaves started using JNLP.
            Hide
            jthompson Jeff Thompson added a comment -

            Alfred Wong, your description sounds different from the original report. The original report was about unpredictable disconnects. These can happen for many reasons, but often occur because of system, network, or environmental issues. Your description concerns re-connection problems. I think it would be better for you to create a separate ticket for your issue.

            Could you share more information about what is occurring? Information about how you launch your agents. Anything relevant about their configuration. Agent logs would be essential.

            Show
            jthompson Jeff Thompson added a comment - Alfred Wong , your description sounds different from the original report. The original report was about unpredictable disconnects. These can happen for many reasons, but often occur because of system, network, or environmental issues. Your description concerns re-connection problems. I think it would be better for you to create a separate ticket for your issue. Could you share more information about what is occurring? Information about how you launch your agents. Anything relevant about their configuration. Agent logs would be essential.
            Hide
            awong29 Alfred Wong added a comment -

            Sure, I can create a new JIRA, I think the original problem I got was the disconnect and it is still happening a few times a day. Our vendor OpenShift and our container team has been spending the last few weeks investigating the issue. I will put the re-connection issue in another JIRA. Thanks.

            Show
            awong29 Alfred Wong added a comment - Sure, I can create a new JIRA, I think the original problem I got was the disconnect and it is still happening a few times a day. Our vendor OpenShift and our container team has been spending the last few weeks investigating the issue. I will put the re-connection issue in another JIRA. Thanks.
            Hide
            jthompson Jeff Thompson added a comment -

            Yes, disconnect issues can be very difficult to track down. They're usually due to something closing the connection at the TCP layer. Or one end being overloaded and unable to maintain its side.

            I think we should re-close this ticket.

            Show
            jthompson Jeff Thompson added a comment - Yes, disconnect issues can be very difficult to track down. They're usually due to something closing the connection at the TCP layer. Or one end being overloaded and unable to maintain its side. I think we should re-close this ticket.
            Hide
            awong29 Alfred Wong added a comment -

            I will update if we find anything more about why the disconnection happen from our IT. Thanks.

            Show
            awong29 Alfred Wong added a comment - I will update if we find anything more about why the disconnection happen from our IT. Thanks.

              People

              • Assignee:
                Unassigned
                Reporter:
                jomega Alexander Trauzzi
              • Votes:
                2 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: