Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23852

slave builds repeatedly broken by "hudson.remoting.ChannelClosedException: channel is already closed"

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
    • Environment:
      Master: Jenkins 1.572 on OsX 10.7.5, Java: 1.6.0_65-b14-462-11M4609
      Slave: Slave version 2.43 (as reported by the versioncolumn plugin) on Windows 2008 R2 Standard (64bit), Java 7 update 55 (32 bit)
    • Similar Issues:

      Description

      I have a similar issue to JENKINS-14332, although that says there is no problem on a windows slave (because it eventually reconnects?), so I'm logging this seperatly.

      The message "hudson.remoting.ChannelClosedException: channel is already closed" breaks my builds on a windows slave. It doesn't happen on every build, but a nightly run of 7 builds seems to have at least one each time.

      Because these seem to be connection issues, so might be imeout related, here's some setup info, even though I doubt it will be relevant: all those builds are matrix builds. My setup consist of an OSX master YYY that does the build for the 'mac' axis, and a slave XXX that does the build of the 'windows' axis. The windows machine is significantly slower than the mac, so some the builds spend quite some time in a state of '1 axis done, other axis queued'. I don't really see issues here, it seems to work fine for most of the builds most of the time: e.g. these 7 builds are all queued at 20:00, and the example below fails shortly after 21:00, but other builds finish fine later on (so after a longer wait). The failing build on XXX started at 20:50, so it fails after 17 minutes. Other, later and earlier builds run for an hour without problems.

      In the middle of a build, the build log (as seen on the master) suddenly says:

      ==========================================
      21:07:10 FATAL: channel is already closed
      21:07:10 hudson.remoting.ChannelClosedException: channel is already closed
      21:07:10 at hudson.remoting.Channel.send(Channel.java:541)
      21:07:10 at hudson.remoting.Request.call(Request.java:129)
      21:07:10 at hudson.remoting.Channel.call(Channel.java:739)
      21:07:10 at hudson.EnvVars.getRemote(EnvVars.java:404)
      21:07:10 at hudson.model.Computer.getEnvironment(Computer.java:912)
      21:07:10 at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
      21:07:10 at hudson.model.Run.getEnvironment(Run.java:2250)
      21:07:10 at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:907)
      21:07:10 at org.jenkinsci.plugins.tokenmacro.impl.EnvironmentVariableMacro.evaluate(EnvironmentVariableMacro.java:23)
      21:07:10 at org.jenkinsci.plugins.tokenmacro.DataBoundTokenMacro.evaluate(DataBoundTokenMacro.java:189)
      21:07:10 at org.jenkinsci.plugins.tokenmacro.TokenMacro.expand(TokenMacro.java:182)
      21:07:10 at org.jenkinsci.plugins.tokenmacro.TokenMacro.expand(TokenMacro.java:154)
      21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter.setDisplayName(BuildNameSetter.java:50)
      21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter.access$000(BuildNameSetter.java:26)
      21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter$1.tearDown(BuildNameSetter.java:42)
      21:07:10 at hudson.model.Build$BuildExecution.doRun(Build.java:171)
      21:07:10 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:535)
      21:07:10 at hudson.model.Run.execute(Run.java:1732)
      21:07:10 at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
      21:07:10 at hudson.model.ResourceController.execute(ResourceController.java:88)
      21:07:10 at hudson.model.Executor.run(Executor.java:234)
      21:07:10 Caused by: java.io.IOException: Failed to abort
      21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
      21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
      21:07:10 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      21:07:10 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      21:07:10 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      21:07:10 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      21:07:10 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      21:07:10 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      21:07:10 at java.lang.Thread.run(Thread.java:695)
      21:07:10 Caused by: java.io.IOException: Operation timed out
      21:07:10 at sun.nio.ch.FileDispatcher.read0(Native Method)
      21:07:10 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      21:07:10 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      21:07:10 at sun.nio.ch.IOUtil.read(IOUtil.java:171)
      21:07:10 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      21:07:10 at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
      21:07:10 at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
      21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
      21:07:10 ... 7 more
      ==========================================

      This combines with this in the slave error log:

      ==========================================
      Jul 16, 2014 9:01:57 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
      SEVERE: I/O error in channel channel
      java.net.SocketException: Software caused connection abort: recv failed
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.read(Unknown Source)
      at java.net.SocketInputStream.read(Unknown Source)
      at java.io.BufferedInputStream.fill(Unknown Source)
      at java.io.BufferedInputStream.read(Unknown Source)
      at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
      at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
      at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
      at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
      at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

      Jul 16, 2014 9:01:57 PM hudson.util.ProcessTree getKillers
      WARNING: Failed to obtain killers
      hudson.remoting.ChannelClosedException: channel is already closed
      at hudson.remoting.Channel.send(Channel.java:541)
      at hudson.remoting.Request.call(Request.java:129)
      at hudson.remoting.Channel.call(Channel.java:739)
      at hudson.util.ProcessTree.getKillers(ProcessTree.java:162)
      at hudson.util.ProcessTree$OSProcess.killByKiller(ProcessTree.java:221)
      at hudson.util.ProcessTree$Windows$1.killRecursively(ProcessTree.java:413)
      at hudson.util.ProcessTree.killAll(ProcessTree.java:149)
      at hudson.Proc$LocalProc.destroy(Proc.java:379)
      at hudson.Proc$LocalProc.join(Proc.java:352)
      at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1116)
      at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      at java.lang.reflect.Method.invoke(Unknown Source)
      at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:309)
      at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:290)
      at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:249)
      at hudson.remoting.UserRequest.perform(UserRequest.java:118)
      at hudson.remoting.UserRequest.perform(UserRequest.java:48)
      at hudson.remoting.Request$2.run(Request.java:328)
      at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      at java.util.concurrent.FutureTask.run(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at hudson.remoting.Engine$1$1.run(Engine.java:63)
      at java.lang.Thread.run(Unknown Source)
      Caused by: java.net.SocketException: Software caused connection abort: recv failed
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.read(Unknown Source)
      at java.net.SocketInputStream.read(Unknown Source)
      at java.io.BufferedInputStream.fill(Unknown Source)
      at java.io.BufferedInputStream.read(Unknown Source)
      at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
      at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
      at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
      at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
      at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

      Jul 16, 2014 9:01:57 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      Jul 16, 2014 9:01:57 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onDisconnect
      INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@18f3cc5
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: XXX
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://YYY:8080/
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to YYY:62768
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
      java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
      at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
      at hudson.remoting.Engine.run(Engine.java:276)

      [repeated reconnection attempts, +- 2 times/sec]

      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: XXX
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://YYY:8080/
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to YYY:62768
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
      java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
      at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
      at hudson.remoting.Engine.run(Engine.java:276)

      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: XXX
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://YYY:8080/
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to YYY:62768
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Jul 16, 2014 9:07:07 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected
      ==========================================

      and the wrapper log says:

      ==========================================
      2014-07-16 21:01:59 - Stopping jenkinsslave-e__JenkinsHome
      2014-07-16 21:01:59 - ProcessKill 3664
      2014-07-16 21:01:59 - Send SIGINT 3664
      2014-07-16 21:01:59 - SIGINT to3664 successful
      2014-07-16 21:01:59 - Finished jenkinsslave-e__JenkinsHome
      2014-07-16 21:02:01 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:02:01 - Started 3264
      2014-07-16 21:02:06 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:02:06 - Started 1364

      [many more of these reconnection attempts]

      2014-07-16 21:07:04 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:07:04 - Started 3448
      2014-07-16 21:07:05 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:07:05 - Started 1384
      2014-07-16 21:07:06 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:07:06 - Started 3140
      [nothing after that, so this is the reconnection attempt that worked, I think.]
      ==========================================

      Finally in the server log:

      ==========================================
      TCP slave agent connection handler #1153 with /IPofXXX:52399 is aborted: Unrecognized name: XXX

      Jul 16, 2014 9:07:10 PM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run

      Accepted connection #1154 from /IPofXXX:52406

      Jul 16, 2014 9:07:10 PM WARNING jenkins.slaves.JnlpSlaveHandshake error

      TCP slave agent connection handler #1154 with /IPofXXX:52406 is aborted: XXX is already connected to this master. Rejecting this connection.

      Jul 16, 2014 9:07:10 PM WARNING jenkins.slaves.JnlpSlaveHandshake error

      TCP slave agent connection handler #1154 with /IPofXXX:52406 is aborted: Unrecognized name: XXX

      Jul 16, 2014 9:07:10 PM WARNING org.jenkinsci.remoting.nio.NioChannelHub run

      Communication problem
      java.io.IOException: Operation timed out
      at sun.nio.ch.FileDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      at sun.nio.ch.IOUtil.read(IOUtil.java:171)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
      at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      at java.lang.Thread.run(Thread.java:695)

      Jul 16, 2014 9:07:11 PM WARNING jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed

      NioChannelHub keys=1 gen=1952690: Computer.threadPoolForRemoting 1 for + XXX terminated
      java.io.IOException: Failed to abort
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      at java.lang.Thread.run(Thread.java:695)
      Caused by: java.io.IOException: Operation timed out
      at sun.nio.ch.FileDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      at sun.nio.ch.IOUtil.read(IOUtil.java:171)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
      at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
      ... 7 more
      ==========================================

      Note that the clocks on both machines are within a few seconds of each other, so it seems that the order of events is:

      • slave detects a failed recv, which closes the channel
      • next communication attempt detects the closed channel
      • slave tries to reconnect, +- 2 times/sec
      • server refuses this because it thinks it is still connected
      • after +-5 minutes, the server pings the slave (some sort of keepalive?), which finds out the connection is already dead
      • at that point, the server accepts the next connection attempt from the slave

      If the cause of the first failed recv can't be found or prevented, it seems the communication between client and server needs some sort of 'reconnection' mechanism (so the client can reconnect right away, instead of having to abort the build and wait until the server times out the connection). Then, the client could reconnect and retry what it was doing, instead of aborting the build on a failed connection?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                legolas Arnt Witteveen
              • Votes:
                41 Vote for this issue
                Watchers:
                52 Start watching this issue

                Dates

                • Created:
                  Updated: