Loading...

Type: Bug
Resolution: Unresolved
Priority: Minor
Component/s: core
Labels:
None
Environment:

Hide
Linux LTS master running on Debian wheezy

java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-5~deb7u1)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Mac slave running mavericks

Show
Linux LTS master running on Debian wheezy java version "1.7.0_65" OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-5~deb7u1) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) Mac slave running mavericks

Similar Issues:

Show

Since approximately 9 months we have constant troubles with our master losing connectivity to our mac mavericks slave (via ssh). The issue we observe is that after some time, the master cannot communicate with the slave anymore, so that jobs fail with the following error message:

Building remotely on MAC_OS_mavericks_64bit (macos mavericks java7)FATAL: channel is already closed
hudson.remoting.ChannelClosedException: channel is already closed
	at hudson.remoting.Channel.send(Channel.java:541)
	at hudson.remoting.Request.call(Request.java:129)
	at hudson.remoting.Channel.call(Channel.java:739)
	at hudson.EnvVars.getRemote(EnvVars.java:404)
	at hudson.model.Computer.getEnvironment(Computer.java:912)
	at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
	at hudson.model.Run.getEnvironment(Run.java:2221)
	at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:885)
	at hudson.matrix.MatrixRun$MatrixRunExecution.decideWorkspace(MatrixRun.java:175)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:513)
	at hudson.model.Run.execute(Run.java:1706)
	at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:232)
Caused by: java.io.IOException
	at hudson.remoting.Channel.close(Channel.java:1027)
	at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
	at hudson.remoting.PingThread.ping(PingThread.java:120)
	at hudson.remoting.PingThread.run(PingThread.java:81)
Caused by: java.util.concurrent.TimeoutException: Ping started on 1412069451954 hasn't completed at 1412069691955
	... 2 more

At some point, the slave is then marked as offline. When trying to reconnect, nothing happens. You see an empty log window with just the circling loading animation. No output is generated ever.

We could not observer any issues with the underlying network connection. Everytime I observe this error, ssh-ing to the slave as the jenkins user is possible without any problems.

This also only happens for the mavericks slave. All other Linux and Windows slave work perfectly.

What is extremely confusing is that in case jenkins ended up in this condition, you cannot restart it in a clean fashion. You first have to kill the java process with SIGKILL, even though it is apparently not completely stuck since operation for everything apart from the mavericks slave continues to work perfectly.

The general log file for jenkins only shows that also the jobs for checking disk space etc. suffer from the connectivity issue:

WARNING: Failed to monitor MAC_OS_mavericks_64bit for Architecture
hudson.remoting.ChannelClosedException: channel is already closed
        at hudson.remoting.Channel.send(Channel.java:541)
        at hudson.remoting.Request.callAsync(Request.java:208)
        at hudson.remoting.Channel.callAsync(Channel.java:766)
        at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMo
nitorDescriptor.java:76)
        at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDe
scriptor.java:280)
Caused by: java.io.IOException
        at hudson.remoting.Channel.close(Channel.java:1027)
        at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
        at hudson.remoting.PingThread.ping(PingThread.java:120)
        at hudson.remoting.PingThread.run(PingThread.java:81)
Caused by: java.util.concurrent.TimeoutException: Ping started on 1412069451954 hasn't complet
ed at 1412069691955
        ... 2 more

Apart from this, no errors are visible for that slave.

A thread dump from the situation where the master tries to reconnect to the salve but nothing happens is available here:
http://pastebin.com/DxFU8j7C

Details

Description

Attachments

Activity

People

Dates