Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: remoting
Labels:
- remoting
Environment:
Master: Jenkins 1.572 on OsX 10.7.5, Java: 1.6.0_65-b14-462-11M4609
Slave: Slave version 2.43 (as reported by the versioncolumn plugin) on Windows 2008 R2 Standard (64bit), Java 7 update 55 (32 bit)

Similar Issues:

Show

I have a similar issue to ~~JENKINS-14332~~, although that says there is no problem on a windows slave (because it eventually reconnects?), so I'm logging this seperatly.

The message "hudson.remoting.ChannelClosedException: channel is already closed" breaks my builds on a windows slave. It doesn't happen on every build, but a nightly run of 7 builds seems to have at least one each time.

Because these seem to be connection issues, so might be imeout related, here's some setup info, even though I doubt it will be relevant: all those builds are matrix builds. My setup consist of an OSX master YYY that does the build for the 'mac' axis, and a slave XXX that does the build of the 'windows' axis. The windows machine is significantly slower than the mac, so some the builds spend quite some time in a state of '1 axis done, other axis queued'. I don't really see issues here, it seems to work fine for most of the builds most of the time: e.g. these 7 builds are all queued at 20:00, and the example below fails shortly after 21:00, but other builds finish fine later on (so after a longer wait). The failing build on XXX started at 20:50, so it fails after 17 minutes. Other, later and earlier builds run for an hour without problems.

In the middle of a build, the build log (as seen on the master) suddenly says:

==========================================
21:07:10 FATAL: channel is already closed
21:07:10 hudson.remoting.ChannelClosedException: channel is already closed
21:07:10 at hudson.remoting.Channel.send(Channel.java:541)
21:07:10 at hudson.remoting.Request.call(Request.java:129)
21:07:10 at hudson.remoting.Channel.call(Channel.java:739)
21:07:10 at hudson.EnvVars.getRemote(EnvVars.java:404)
21:07:10 at hudson.model.Computer.getEnvironment(Computer.java:912)
21:07:10 at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
21:07:10 at hudson.model.Run.getEnvironment(Run.java:2250)
21:07:10 at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:907)
21:07:10 at org.jenkinsci.plugins.tokenmacro.impl.EnvironmentVariableMacro.evaluate(EnvironmentVariableMacro.java:23)
21:07:10 at org.jenkinsci.plugins.tokenmacro.DataBoundTokenMacro.evaluate(DataBoundTokenMacro.java:189)
21:07:10 at org.jenkinsci.plugins.tokenmacro.TokenMacro.expand(TokenMacro.java:182)
21:07:10 at org.jenkinsci.plugins.tokenmacro.TokenMacro.expand(TokenMacro.java:154)
21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter.setDisplayName(BuildNameSetter.java:50)
21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter.access$000(BuildNameSetter.java:26)
21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter$1.tearDown(BuildNameSetter.java:42)
21:07:10 at hudson.model.Build$BuildExecution.doRun(Build.java:171)
21:07:10 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:535)
21:07:10 at hudson.model.Run.execute(Run.java:1732)
21:07:10 at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
21:07:10 at hudson.model.ResourceController.execute(ResourceController.java:88)
21:07:10 at hudson.model.Executor.run(Executor.java:234)
21:07:10 Caused by: java.io.IOException: Failed to abort
21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
21:07:10 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
21:07:10 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
21:07:10 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
21:07:10 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
21:07:10 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
21:07:10 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
21:07:10 at java.lang.Thread.run(Thread.java:695)
21:07:10 Caused by: java.io.IOException: Operation timed out
21:07:10 at sun.nio.ch.FileDispatcher.read0(Native Method)
21:07:10 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
21:07:10 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
21:07:10 at sun.nio.ch.IOUtil.read(IOUtil.java:171)
21:07:10 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
21:07:10 at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
21:07:10 at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
21:07:10 ... 7 more
==========================================

This combines with this in the slave error log:

==========================================
Jul 16, 2014 9:01:57 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.net.SocketException: Software caused connection abort: recv failed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

Jul 16, 2014 9:01:57 PM hudson.util.ProcessTree getKillers
WARNING: Failed to obtain killers
hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:541)
at hudson.remoting.Request.call(Request.java:129)
at hudson.remoting.Channel.call(Channel.java:739)
at hudson.util.ProcessTree.getKillers(ProcessTree.java:162)
at hudson.util.ProcessTree$OSProcess.killByKiller(ProcessTree.java:221)
at hudson.util.ProcessTree$Windows$1.killRecursively(ProcessTree.java:413)
at hudson.util.ProcessTree.killAll(ProcessTree.java:149)
at hudson.Proc$LocalProc.destroy(Proc.java:379)
at hudson.Proc$LocalProc.join(Proc.java:352)
at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1116)
at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:309)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:290)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:249)
at hudson.remoting.UserRequest.perform(UserRequest.java:118)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:328)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at hudson.remoting.Engine$1$1.run(Engine.java:63)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Software caused connection abort: recv failed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

Jul 16, 2014 9:01:57 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
Jul 16, 2014 9:01:57 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onDisconnect
INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@18f3cc5
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: XXX
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among http://YYY:8080/
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to YYY:62768
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
at hudson.remoting.Engine.run(Engine.java:276)

[repeated reconnection attempts, +- 2 times/sec]

Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: XXX
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among http://YYY:8080/
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to YYY:62768
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
at hudson.remoting.Engine.run(Engine.java:276)

Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: XXX
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among http://YYY:8080/
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to YYY:62768
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Jul 16, 2014 9:07:07 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
==========================================

and the wrapper log says:

==========================================
2014-07-16 21:01:59 - Stopping jenkinsslave-e__JenkinsHome
2014-07-16 21:01:59 - ProcessKill 3664
2014-07-16 21:01:59 - Send SIGINT 3664
2014-07-16 21:01:59 - SIGINT to3664 successful
2014-07-16 21:01:59 - Finished jenkinsslave-e__JenkinsHome
2014-07-16 21:02:01 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:02:01 - Started 3264
2014-07-16 21:02:06 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:02:06 - Started 1364

[many more of these reconnection attempts]

2014-07-16 21:07:04 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:07:04 - Started 3448
2014-07-16 21:07:05 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:07:05 - Started 1384
2014-07-16 21:07:06 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:07:06 - Started 3140
[nothing after that, so this is the reconnection attempt that worked, I think.]
==========================================

Finally in the server log:

==========================================
TCP slave agent connection handler #1153 with /IPofXXX:52399 is aborted: Unrecognized name: XXX

Jul 16, 2014 9:07:10 PM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run

Accepted connection #1154 from /IPofXXX:52406

Jul 16, 2014 9:07:10 PM WARNING jenkins.slaves.JnlpSlaveHandshake error

TCP slave agent connection handler #1154 with /IPofXXX:52406 is aborted: XXX is already connected to this master. Rejecting this connection.

Jul 16, 2014 9:07:10 PM WARNING jenkins.slaves.JnlpSlaveHandshake error

TCP slave agent connection handler #1154 with /IPofXXX:52406 is aborted: Unrecognized name: XXX

Jul 16, 2014 9:07:10 PM WARNING org.jenkinsci.remoting.nio.NioChannelHub run

Communication problem
java.io.IOException: Operation timed out
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:171)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)

Jul 16, 2014 9:07:11 PM WARNING jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed

NioChannelHub keys=1 gen=1952690: Computer.threadPoolForRemoting 1 for + XXX terminated
java.io.IOException: Failed to abort
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
Caused by: java.io.IOException: Operation timed out
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:171)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
... 7 more
==========================================

Note that the clocks on both machines are within a few seconds of each other, so it seems that the order of events is:

slave detects a failed recv, which closes the channel
next communication attempt detects the closed channel
slave tries to reconnect, +- 2 times/sec
server refuses this because it thinks it is still connected
after +-5 minutes, the server pings the slave (some sort of keepalive?), which finds out the connection is already dead
at that point, the server accepts the next connection attempt from the slave

If the cause of the first failed recv can't be found or prevented, it seems the communication between client and server needs some sort of 'reconnection' mechanism (so the client can reconnect right away, instead of having to abort the build and wait until the server times out the connection). Then, the client could reconnect and retry what it was doing, instead of aborting the build on a failed connection?

relates to

JENKINS-14332 Repeated channel/timeout errors from Jenkins slave

Closed

Details

Description

Attachments

Issue Links

Activity

People

Dates