Almost on daily basis my Jenkins is shutting down is taking ALL agents offline. The reasons for this is unknown to me and looks like a severe bug. Can you please help to check this?
Based on my observation I notice that connecting new agents seems to fail with an SSL exception.
[33mSep 22, 2015 8:08:42 AM org.eclipse.jetty.util.log.JavaUtilLog warn
WARNING:
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:293)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:402)
at org.eclipse.jetty.io.nio.SslConnection.process(SslConnection.java:337)
at org.eclipse.jetty.io.nio.SslConnection.access$900(SslConnection.java:48)
at org.eclipse.jetty.io.nio.SslConnection$SslEndPoint.flush(SslConnection.java:738)
at org.eclipse.jetty.io.nio.SslConnection$SslEndPoint.shutdownOutput(SslConnection.java:641)
at org.eclipse.jetty.io.nio.SslConnection.onIdleExpired(SslConnection.java:260)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.onIdleExpired(SelectChannelEndPoint.java:349)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:326)
at winstone.BoundedExecutorService$1.run(BoundedExecutorService.java:77)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
[0m[33mSep 22, 2015 8:08:48 AM org.eclipse.jetty.util.log.JavaUtilLog warn
WARNING: handle failed
java.lang.IllegalStateException: Internal error
at sun.security.ssl.SSLEngineImpl.initHandshaker(Unknown Source)
at sun.security.ssl.SSLEngineImpl.readRecord(Unknown Source)
at sun.security.ssl.SSLEngineImpl.readNetRecord(Unknown Source)
at sun.security.ssl.SSLEngineImpl.unwrap(Unknown Source)
at javax.net.ssl.SSLEngine.unwrap(Unknown Source)
at org.eclipse.jetty.io.nio.SslConnection.unwrap(SslConnection.java:536)
at org.eclipse.jetty.io.nio.SslConnection.process(SslConnection.java:401)
at org.eclipse.jetty.io.nio.SslConnection.handle(SslConnection.java:193)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:668)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at winstone.BoundedExecutorService$1.run(BoundedExecutorService.java:77)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Shortly afterwards I can see that Jenkins is taking ALL agents offline
Sep 22, 2015 8:20:54 AM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel SLAVE-101051.
java.util.concurrent.TimeoutException: Ping started at 1442902614156 hasn't completed by 1442902854206
at hudson.remoting.PingThread.ping(PingThread.java:126)
at hudson.remoting.PingThread.run(PingThread.java:85)
Afterwards ALL agents want to register back to Jenkins but Jenkins is rejecting it with
INFO: Accepted connection #288 from /10.0.209.109:64213
[33mSep 22, 2015 8:47:00 AM jenkins.slaves.JnlpSlaveHandshake error
WARNING: TCP slave agent connection handler #288 with /10.0.209.109:64213 is aborted: SLAVE-719161 is already connected to this master. Rejecting this connection.
[0mSep 22, 2015 8:47:00 AM hudson.TcpSlaveAgentListener$ConnectionHandler run
If Jenkins kicks out all agents, I would expect Jenkins to allow it get automatically accepted again instead of referring to already existing connection. But that all agents are being taken offline at once due to PING FAIL is rather a bug.
Please find full logs attached as well!