Status: Closed (View Workflow)
Environment:Jenkins Master - 2.100, Ubuntu
Linux Agent - Running inside a container on Ubuntu, 2.100 agent jar
Windows Agent - Running inside a container on Windows Server 1709
I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.
Overall, the agents are able to connect and perform builds through to completion. Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents. Especially after they've been idle for a bit.
I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail. I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.
I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing. One recent disconnect produced this on the linux agent:
Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 184.108.40.206/220.127.116.11:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)
This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again. Here's what the Jenkins master reports for the agent:
java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.
I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker. If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.
I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!