Since approximately 9 months we have constant troubles with our master losing connectivity to our mac mavericks slave (via ssh). The issue we observe is that after some time, the master cannot communicate with the slave anymore, so that jobs fail with the following error message:
Building remotely on MAC_OS_mavericks_64bit (macos mavericks java7)FATAL: channel is already closed hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:541) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:739) at hudson.EnvVars.getRemote(EnvVars.java:404) at hudson.model.Computer.getEnvironment(Computer.java:912) at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29) at hudson.model.Run.getEnvironment(Run.java:2221) at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:885) at hudson.matrix.MatrixRun$MatrixRunExecution.decideWorkspace(MatrixRun.java:175) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:513) at hudson.model.Run.execute(Run.java:1706) at hudson.matrix.MatrixRun.run(MatrixRun.java:146) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:232) Caused by: java.io.IOException at hudson.remoting.Channel.close(Channel.java:1027) at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110) at hudson.remoting.PingThread.ping(PingThread.java:120) at hudson.remoting.PingThread.run(PingThread.java:81) Caused by: java.util.concurrent.TimeoutException: Ping started on 1412069451954 hasn't completed at 1412069691955 ... 2 more
At some point, the slave is then marked as offline. When trying to reconnect, nothing happens. You see an empty log window with just the circling loading animation. No output is generated ever.
We could not observer any issues with the underlying network connection. Everytime I observe this error, ssh-ing to the slave as the jenkins user is possible without any problems.
This also only happens for the mavericks slave. All other Linux and Windows slave work perfectly.
What is extremely confusing is that in case jenkins ended up in this condition, you cannot restart it in a clean fashion. You first have to kill the java process with SIGKILL, even though it is apparently not completely stuck since operation for everything apart from the mavericks slave continues to work perfectly.
The general log file for jenkins only shows that also the jobs for checking disk space etc. suffer from the connectivity issue:
WARNING: Failed to monitor MAC_OS_mavericks_64bit for Architecture hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:541) at hudson.remoting.Request.callAsync(Request.java:208) at hudson.remoting.Channel.callAsync(Channel.java:766) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMo nitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDe scriptor.java:280) Caused by: java.io.IOException at hudson.remoting.Channel.close(Channel.java:1027) at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110) at hudson.remoting.PingThread.ping(PingThread.java:120) at hudson.remoting.PingThread.run(PingThread.java:81) Caused by: java.util.concurrent.TimeoutException: Ping started on 1412069451954 hasn't complet ed at 1412069691955 ... 2 more
Apart from this, no errors are visible for that slave.
A thread dump from the situation where the master tries to reconnect to the salve but nothing happens is available here:
http://pastebin.com/DxFU8j7C