Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23852

slave builds repeatedly broken by "hudson.remoting.ChannelClosedException: channel is already closed"

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
    • Environment:
      Master: Jenkins 1.572 on OsX 10.7.5, Java: 1.6.0_65-b14-462-11M4609
      Slave: Slave version 2.43 (as reported by the versioncolumn plugin) on Windows 2008 R2 Standard (64bit), Java 7 update 55 (32 bit)
    • Similar Issues:

      Description

      I have a similar issue to JENKINS-14332, although that says there is no problem on a windows slave (because it eventually reconnects?), so I'm logging this seperatly.

      The message "hudson.remoting.ChannelClosedException: channel is already closed" breaks my builds on a windows slave. It doesn't happen on every build, but a nightly run of 7 builds seems to have at least one each time.

      Because these seem to be connection issues, so might be imeout related, here's some setup info, even though I doubt it will be relevant: all those builds are matrix builds. My setup consist of an OSX master YYY that does the build for the 'mac' axis, and a slave XXX that does the build of the 'windows' axis. The windows machine is significantly slower than the mac, so some the builds spend quite some time in a state of '1 axis done, other axis queued'. I don't really see issues here, it seems to work fine for most of the builds most of the time: e.g. these 7 builds are all queued at 20:00, and the example below fails shortly after 21:00, but other builds finish fine later on (so after a longer wait). The failing build on XXX started at 20:50, so it fails after 17 minutes. Other, later and earlier builds run for an hour without problems.

      In the middle of a build, the build log (as seen on the master) suddenly says:

      ==========================================
      21:07:10 FATAL: channel is already closed
      21:07:10 hudson.remoting.ChannelClosedException: channel is already closed
      21:07:10 at hudson.remoting.Channel.send(Channel.java:541)
      21:07:10 at hudson.remoting.Request.call(Request.java:129)
      21:07:10 at hudson.remoting.Channel.call(Channel.java:739)
      21:07:10 at hudson.EnvVars.getRemote(EnvVars.java:404)
      21:07:10 at hudson.model.Computer.getEnvironment(Computer.java:912)
      21:07:10 at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
      21:07:10 at hudson.model.Run.getEnvironment(Run.java:2250)
      21:07:10 at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:907)
      21:07:10 at org.jenkinsci.plugins.tokenmacro.impl.EnvironmentVariableMacro.evaluate(EnvironmentVariableMacro.java:23)
      21:07:10 at org.jenkinsci.plugins.tokenmacro.DataBoundTokenMacro.evaluate(DataBoundTokenMacro.java:189)
      21:07:10 at org.jenkinsci.plugins.tokenmacro.TokenMacro.expand(TokenMacro.java:182)
      21:07:10 at org.jenkinsci.plugins.tokenmacro.TokenMacro.expand(TokenMacro.java:154)
      21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter.setDisplayName(BuildNameSetter.java:50)
      21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter.access$000(BuildNameSetter.java:26)
      21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter$1.tearDown(BuildNameSetter.java:42)
      21:07:10 at hudson.model.Build$BuildExecution.doRun(Build.java:171)
      21:07:10 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:535)
      21:07:10 at hudson.model.Run.execute(Run.java:1732)
      21:07:10 at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
      21:07:10 at hudson.model.ResourceController.execute(ResourceController.java:88)
      21:07:10 at hudson.model.Executor.run(Executor.java:234)
      21:07:10 Caused by: java.io.IOException: Failed to abort
      21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
      21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
      21:07:10 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      21:07:10 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      21:07:10 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      21:07:10 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      21:07:10 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      21:07:10 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      21:07:10 at java.lang.Thread.run(Thread.java:695)
      21:07:10 Caused by: java.io.IOException: Operation timed out
      21:07:10 at sun.nio.ch.FileDispatcher.read0(Native Method)
      21:07:10 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      21:07:10 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      21:07:10 at sun.nio.ch.IOUtil.read(IOUtil.java:171)
      21:07:10 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      21:07:10 at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
      21:07:10 at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
      21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
      21:07:10 ... 7 more
      ==========================================

      This combines with this in the slave error log:

      ==========================================
      Jul 16, 2014 9:01:57 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
      SEVERE: I/O error in channel channel
      java.net.SocketException: Software caused connection abort: recv failed
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.read(Unknown Source)
      at java.net.SocketInputStream.read(Unknown Source)
      at java.io.BufferedInputStream.fill(Unknown Source)
      at java.io.BufferedInputStream.read(Unknown Source)
      at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
      at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
      at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
      at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
      at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

      Jul 16, 2014 9:01:57 PM hudson.util.ProcessTree getKillers
      WARNING: Failed to obtain killers
      hudson.remoting.ChannelClosedException: channel is already closed
      at hudson.remoting.Channel.send(Channel.java:541)
      at hudson.remoting.Request.call(Request.java:129)
      at hudson.remoting.Channel.call(Channel.java:739)
      at hudson.util.ProcessTree.getKillers(ProcessTree.java:162)
      at hudson.util.ProcessTree$OSProcess.killByKiller(ProcessTree.java:221)
      at hudson.util.ProcessTree$Windows$1.killRecursively(ProcessTree.java:413)
      at hudson.util.ProcessTree.killAll(ProcessTree.java:149)
      at hudson.Proc$LocalProc.destroy(Proc.java:379)
      at hudson.Proc$LocalProc.join(Proc.java:352)
      at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1116)
      at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      at java.lang.reflect.Method.invoke(Unknown Source)
      at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:309)
      at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:290)
      at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:249)
      at hudson.remoting.UserRequest.perform(UserRequest.java:118)
      at hudson.remoting.UserRequest.perform(UserRequest.java:48)
      at hudson.remoting.Request$2.run(Request.java:328)
      at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      at java.util.concurrent.FutureTask.run(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at hudson.remoting.Engine$1$1.run(Engine.java:63)
      at java.lang.Thread.run(Unknown Source)
      Caused by: java.net.SocketException: Software caused connection abort: recv failed
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.read(Unknown Source)
      at java.net.SocketInputStream.read(Unknown Source)
      at java.io.BufferedInputStream.fill(Unknown Source)
      at java.io.BufferedInputStream.read(Unknown Source)
      at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
      at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
      at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
      at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
      at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

      Jul 16, 2014 9:01:57 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      Jul 16, 2014 9:01:57 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onDisconnect
      INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@18f3cc5
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: XXX
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://YYY:8080/
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to YYY:62768
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
      java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
      at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
      at hudson.remoting.Engine.run(Engine.java:276)

      [repeated reconnection attempts, +- 2 times/sec]

      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: XXX
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://YYY:8080/
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to YYY:62768
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
      java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
      at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
      at hudson.remoting.Engine.run(Engine.java:276)

      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: XXX
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://YYY:8080/
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to YYY:62768
      Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Jul 16, 2014 9:07:07 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected
      ==========================================

      and the wrapper log says:

      ==========================================
      2014-07-16 21:01:59 - Stopping jenkinsslave-e__JenkinsHome
      2014-07-16 21:01:59 - ProcessKill 3664
      2014-07-16 21:01:59 - Send SIGINT 3664
      2014-07-16 21:01:59 - SIGINT to3664 successful
      2014-07-16 21:01:59 - Finished jenkinsslave-e__JenkinsHome
      2014-07-16 21:02:01 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:02:01 - Started 3264
      2014-07-16 21:02:06 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:02:06 - Started 1364

      [many more of these reconnection attempts]

      2014-07-16 21:07:04 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:07:04 - Started 3448
      2014-07-16 21:07:05 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:07:05 - Started 1384
      2014-07-16 21:07:06 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
      2014-07-16 21:07:06 - Started 3140
      [nothing after that, so this is the reconnection attempt that worked, I think.]
      ==========================================

      Finally in the server log:

      ==========================================
      TCP slave agent connection handler #1153 with /IPofXXX:52399 is aborted: Unrecognized name: XXX

      Jul 16, 2014 9:07:10 PM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run

      Accepted connection #1154 from /IPofXXX:52406

      Jul 16, 2014 9:07:10 PM WARNING jenkins.slaves.JnlpSlaveHandshake error

      TCP slave agent connection handler #1154 with /IPofXXX:52406 is aborted: XXX is already connected to this master. Rejecting this connection.

      Jul 16, 2014 9:07:10 PM WARNING jenkins.slaves.JnlpSlaveHandshake error

      TCP slave agent connection handler #1154 with /IPofXXX:52406 is aborted: Unrecognized name: XXX

      Jul 16, 2014 9:07:10 PM WARNING org.jenkinsci.remoting.nio.NioChannelHub run

      Communication problem
      java.io.IOException: Operation timed out
      at sun.nio.ch.FileDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      at sun.nio.ch.IOUtil.read(IOUtil.java:171)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
      at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      at java.lang.Thread.run(Thread.java:695)

      Jul 16, 2014 9:07:11 PM WARNING jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed

      NioChannelHub keys=1 gen=1952690: Computer.threadPoolForRemoting 1 for + XXX terminated
      java.io.IOException: Failed to abort
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      at java.lang.Thread.run(Thread.java:695)
      Caused by: java.io.IOException: Operation timed out
      at sun.nio.ch.FileDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      at sun.nio.ch.IOUtil.read(IOUtil.java:171)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
      at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
      ... 7 more
      ==========================================

      Note that the clocks on both machines are within a few seconds of each other, so it seems that the order of events is:

      • slave detects a failed recv, which closes the channel
      • next communication attempt detects the closed channel
      • slave tries to reconnect, +- 2 times/sec
      • server refuses this because it thinks it is still connected
      • after +-5 minutes, the server pings the slave (some sort of keepalive?), which finds out the connection is already dead
      • at that point, the server accepts the next connection attempt from the slave

      If the cause of the first failed recv can't be found or prevented, it seems the communication between client and server needs some sort of 'reconnection' mechanism (so the client can reconnect right away, instead of having to abort the build and wait until the server times out the connection). Then, the client could reconnect and retry what it was doing, instead of aborting the build on a failed connection?

        Attachments

          Issue Links

            Activity

            Hide
            legolas Arnt Witteveen added a comment -

            JENKINS-19513 may also be related.

            Show
            legolas Arnt Witteveen added a comment - JENKINS-19513 may also be related.
            Hide
            kohsuke Kohsuke Kawaguchi added a comment -

            Thanks for the detailed log analysis. I agree with your order of events list, although there are some details that are still suspicious. For example, if the client has decided to shutdown the connection and that properly arrived at the server, we should see something other than "java.io.IOException: Operation timed out" in the read. And we don't set any read timeout, so I'm puzzled as to why we see this exception during the read.

            In any case, according to that theory, the first event that caused this is the following:

            Jul 16, 2014 9:01:57 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
            SEVERE: I/O error in channel channel
            java.net.SocketException: Software caused connection abort: recv failed
            at java.net.SocketInputStream.socketRead0(Native Method)
            

            This appears to be an error unique to Windows, and I found this stackoverflow conversation helpful to understand what it means, as well as this KB article.

            If I understand this error correctly, it means that Jenkins slave has given WinSock some data to send, and WinSock has sent some packets over the network, but it's not getting the data through to the other side, and so it has given up and declared the connection lost. That happens much later after the write is called, so the next read that came in gets hit by this problem.

            The links I mentioned above do not really discuss what that "data transmission time-out" is about. It could be anything from WinSock not getting TCK ACK from the other side, TCP flow control, or self-imposed timeout unique to WinSock.

            I'd have normally suspected the network issue, but the succssive reconnection attempts seem to exclude this possibility.

            Another possibility could be that this is related to issues like JENKINS-24050, where an unrelated problem kills the NIO selector thread. When that happens, all the sockets stopped getting serviced, so data that the client sent will not be picked up by the application on the server side. TCP ACK should still come in this case, since kernel is getting the packets all right, but perhaps in this situation WinSock might decide to time out, perhaps due to a self-imposed timeout. This would need some experiments.

            I've just fixed JENKINS-24050, so if possible I'd love to have you try this fix, which should make it into 1.580. Another useful test is to disable the use of NIO on the server side for managing JNLP slaves with the system property '-Djenkins.slaves.NioChannelSelector.disabled=true' on the master. If the problem goes away with it, that'd be an useful input.

            For anyone else seeing problems, please check relevant logs on both the server and the slave and share it with us, so that we can resolve this problem efficiently.

            Show
            kohsuke Kohsuke Kawaguchi added a comment - Thanks for the detailed log analysis. I agree with your order of events list, although there are some details that are still suspicious. For example, if the client has decided to shutdown the connection and that properly arrived at the server, we should see something other than "java.io.IOException: Operation timed out" in the read. And we don't set any read timeout, so I'm puzzled as to why we see this exception during the read. In any case, according to that theory, the first event that caused this is the following: Jul 16, 2014 9:01:57 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.net.SocketException: Software caused connection abort: recv failed at java.net.SocketInputStream.socketRead0(Native Method) This appears to be an error unique to Windows, and I found this stackoverflow conversation helpful to understand what it means, as well as this KB article . If I understand this error correctly, it means that Jenkins slave has given WinSock some data to send, and WinSock has sent some packets over the network, but it's not getting the data through to the other side, and so it has given up and declared the connection lost. That happens much later after the write is called, so the next read that came in gets hit by this problem. The links I mentioned above do not really discuss what that "data transmission time-out" is about. It could be anything from WinSock not getting TCK ACK from the other side, TCP flow control, or self-imposed timeout unique to WinSock. I'd have normally suspected the network issue, but the succssive reconnection attempts seem to exclude this possibility. Another possibility could be that this is related to issues like JENKINS-24050 , where an unrelated problem kills the NIO selector thread. When that happens, all the sockets stopped getting serviced, so data that the client sent will not be picked up by the application on the server side. TCP ACK should still come in this case, since kernel is getting the packets all right, but perhaps in this situation WinSock might decide to time out, perhaps due to a self-imposed timeout. This would need some experiments. I've just fixed JENKINS-24050 , so if possible I'd love to have you try this fix, which should make it into 1.580. Another useful test is to disable the use of NIO on the server side for managing JNLP slaves with the system property '-Djenkins.slaves.NioChannelSelector.disabled=true' on the master. If the problem goes away with it, that'd be an useful input. For anyone else seeing problems, please check relevant logs on both the server and the slave and share it with us, so that we can resolve this problem efficiently.
            Hide
            legolas Arnt Witteveen added a comment -

            You are correct that this was a network issue: after some time I noticed this always happened around 9PM and mentioned this to a network admin. He said that something similar had happened once, and we let a ping run for a whole day. It seems that there were some network interruptions (of up to a minute IIRC) at the specific time this happened. The machine got switched to another port on the router (or another router), and the issue has not reoccurred since.

            Sorry for not mentioning this earlier, I forgot about having logged this.

            Show
            legolas Arnt Witteveen added a comment - You are correct that this was a network issue: after some time I noticed this always happened around 9PM and mentioned this to a network admin. He said that something similar had happened once, and we let a ping run for a whole day. It seems that there were some network interruptions (of up to a minute IIRC) at the specific time this happened. The machine got switched to another port on the router (or another router), and the issue has not reoccurred since. Sorry for not mentioning this earlier, I forgot about having logged this.
            Hide
            oldelvet Richard Mortimer added a comment -

            I have been seeing similar problems to this. In my situation one of my slaves is a Windows 7 desktop computer that is set to suspend after a period of inactivity. I've had this running for a few years without problem and the slave has always reconnected successfully when the desktop is woken up.

            However since the NIO selector thread has been introduced the slaves have ended up being offline for prolonged periods of time and require a manual restart of the Jenkins slave service to bring them back online (see JENKINS-24272). One other symptom that I have observed during testing of this is the server rejecting connections similar to that reported here namely:

            Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener error
            SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
            java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
            at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
            at hudson.remoting.Engine.run(Engine.java:276)
            

            Having investigated I'm seeing that the slave has decided that the connection is broken and has tore down its TCP connection to the master but the master never sees the TCP RST for this so it continues to retry the TCP packets for 10 or so minutes until the TCP connection times out and then the master tears down the connection.

            The really strange this is that this behaviour seems to be different between NIO and threaded methods on the master. With threaded the TCP RST is sent by W7 and kills the connection before the slave reconnects but in NIO no RST is ever sent. That happens down at the network packet level so it shouldn't really be affected by NIO but I wonder if it a slight different configuration in the network stack (non-blocking vs blocking) that causes the different behaviour.

            I also suspect that this may be related to the change to re-start JNLP slaves using a new JVM rather than just restarting in the same process. It may be that the Windows TCP stack just immediately forgets the old TCP socket when the old JVM process exits and hence ignores retries rather than RSTing the connection.

            I'm planning on waiting for the current crop of JNLP slave fixes to get integrated and will then experiment a bit further to see if I can work out what is stopping the NIO selector getting notified that the slave has closed the TCP connection.

            Show
            oldelvet Richard Mortimer added a comment - I have been seeing similar problems to this. In my situation one of my slaves is a Windows 7 desktop computer that is set to suspend after a period of inactivity. I've had this running for a few years without problem and the slave has always reconnected successfully when the desktop is woken up. However since the NIO selector thread has been introduced the slaves have ended up being offline for prolonged periods of time and require a manual restart of the Jenkins slave service to bring them back online (see JENKINS-24272 ). One other symptom that I have observed during testing of this is the server rejecting connections similar to that reported here namely: Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:304) at hudson.remoting.Engine.run(Engine.java:276) Having investigated I'm seeing that the slave has decided that the connection is broken and has tore down its TCP connection to the master but the master never sees the TCP RST for this so it continues to retry the TCP packets for 10 or so minutes until the TCP connection times out and then the master tears down the connection. The really strange this is that this behaviour seems to be different between NIO and threaded methods on the master. With threaded the TCP RST is sent by W7 and kills the connection before the slave reconnects but in NIO no RST is ever sent. That happens down at the network packet level so it shouldn't really be affected by NIO but I wonder if it a slight different configuration in the network stack (non-blocking vs blocking) that causes the different behaviour. I also suspect that this may be related to the change to re-start JNLP slaves using a new JVM rather than just restarting in the same process. It may be that the Windows TCP stack just immediately forgets the old TCP socket when the old JVM process exits and hence ignores retries rather than RSTing the connection. I'm planning on waiting for the current crop of JNLP slave fixes to get integrated and will then experiment a bit further to see if I can work out what is stopping the NIO selector getting notified that the slave has closed the TCP connection.
            Hide
            legolas Arnt Witteveen added a comment -

            Thinking about this some more, I still think it would be nice that a temporary network outage of half a minute or so :

            • does not abort a build (that may have been halfway through a build process of several hours). I wouldn't mind seeing, in the log on the server, a message saying [network was out, log may be missing a part here] in the middle of the log, but why stop the build completely right away? Even in cases where the slave needs the server, there's little point in aborting the build, because the slave is not going to get new work from the server if it can't connect to it anyway?
            • does not cause the client to be offline for that long. If the server receives a new connection attempt of a slave node, and there already is a connection of the same slave node, who not just disconnect the old connection and continue with the new one?
            Show
            legolas Arnt Witteveen added a comment - Thinking about this some more, I still think it would be nice that a temporary network outage of half a minute or so : does not abort a build (that may have been halfway through a build process of several hours). I wouldn't mind seeing, in the log on the server, a message saying [network was out, log may be missing a part here] in the middle of the log, but why stop the build completely right away? Even in cases where the slave needs the server, there's little point in aborting the build, because the slave is not going to get new work from the server if it can't connect to it anyway? does not cause the client to be offline for that long. If the server receives a new connection attempt of a slave node, and there already is a connection of the same slave node, who not just disconnect the old connection and continue with the new one?
            Hide
            oldelvet Richard Mortimer added a comment -

            @Arnt Witteveen I agree it would be nice if it could withstand a network outage of a minute or so but there are a few technical challenges to overcome which make this non-trivial.

            In theory things should survive for a brief network outage and indeed they do on a linux-to-linux master/slave arrangement but windows seems to be much more aggressive in closing down TCP connections and hence the windows side of things has gone before the network connection has been restored.

            I believe that making the low level protocol cope with a break of the TCP connection is non-trivial.

            Show
            oldelvet Richard Mortimer added a comment - @Arnt Witteveen I agree it would be nice if it could withstand a network outage of a minute or so but there are a few technical challenges to overcome which make this non-trivial. In theory things should survive for a brief network outage and indeed they do on a linux-to-linux master/slave arrangement but windows seems to be much more aggressive in closing down TCP connections and hence the windows side of things has gone before the network connection has been restored. I believe that making the low level protocol cope with a break of the TCP connection is non-trivial.
            Hide
            danielbeck Daniel Beck added a comment -

            temporary network outage of half a minute

            What kind of real network has regular outages like that? I'd try to discuss SLA with network admins

            A few workarounds to this issue:

            Jenkins Enterprise by Cloudbees (Jenkins + commercial plugins and support) offers a "long running builds plugin" that can survive disconnects, master restarts, etc:
            http://release-notes.cloudbees.com/product/CloudBees+Long-Running+Build+Plugin

            Something similar will be made available for free as part of the workflow plugin:
            https://github.com/jenkinsci/workflow-plugin

            Show
            danielbeck Daniel Beck added a comment - temporary network outage of half a minute What kind of real network has regular outages like that? I'd try to discuss SLA with network admins A few workarounds to this issue: Jenkins Enterprise by Cloudbees (Jenkins + commercial plugins and support) offers a "long running builds plugin" that can survive disconnects, master restarts, etc: http://release-notes.cloudbees.com/product/CloudBees+Long-Running+Build+Plugin Something similar will be made available for free as part of the workflow plugin: https://github.com/jenkinsci/workflow-plugin
            Hide
            michaelcm Michael M added a comment -

            Just wanted to follow up. We see this issue a few times a week, and are confident that there isn't technically a network outage at all as other nodes and jobs stay connected through this error. It most frequently happens for us at the end of a job, once as far as I'm concerned everything is complete. Can attach more logs of the error if that is helpful.

            Running Jenkins 1.583 on Windows 7.

            Thanks!

            Show
            michaelcm Michael M added a comment - Just wanted to follow up. We see this issue a few times a week, and are confident that there isn't technically a network outage at all as other nodes and jobs stay connected through this error. It most frequently happens for us at the end of a job, once as far as I'm concerned everything is complete. Can attach more logs of the error if that is helpful. Running Jenkins 1.583 on Windows 7. Thanks!
            Hide
            legolas Arnt Witteveen added a comment -

            Meanwhile, we have found another cause of these disconnects, which may help in reproducing it (if you happen to have this setup available) : this happens when the slave is on a VMWare ESXi virtual machine, and you have multiple hosts (physical machines running VMWare ESX) available, and you have set up VMotion with DRS: at the moment the current slave VM host gets overloaded (e.g. because it starts a build!) the load is rebalanced over the hosts, and the slave VM may move to another host. This movement should be transparent but seems to cause this issue (I'm not sure why, it might be a timeout, it might be the fact that the physical network card that receives the traffic changes, or something else still).

            Show
            legolas Arnt Witteveen added a comment - Meanwhile, we have found another cause of these disconnects, which may help in reproducing it (if you happen to have this setup available) : this happens when the slave is on a VMWare ESXi virtual machine, and you have multiple hosts (physical machines running VMWare ESX) available, and you have set up VMotion with DRS: at the moment the current slave VM host gets overloaded (e.g. because it starts a build!) the load is rebalanced over the hosts, and the slave VM may move to another host. This movement should be transparent but seems to cause this issue (I'm not sure why, it might be a timeout, it might be the fact that the physical network card that receives the traffic changes, or something else still).
            Hide
            dserodio Daniel Serodio added a comment -

            Thanks Arnt Witteveen, this is exactly our setup. As a workaround, I've asked our ops team do disable VMotion for these VMs

            Show
            dserodio Daniel Serodio added a comment - Thanks Arnt Witteveen , this is exactly our setup. As a workaround, I've asked our ops team do disable VMotion for these VMs
            Hide
            stephenconnolly Stephen Connolly added a comment -

            By any chance are you provisioning and de-provisioning other unrelated slaves? I ask as a slave disconnect, if incorrectly timed and unlucky enough can cause NioChannelHub to livelock (JENKINS-25218) trying to process the channel termination for the disconnected slave until the ping thread kills that channel and breaks the livelock.

            Disabling the ping threads on the slave and the master should either provide an alternative workaround or confirm that this issue is JENKINS-25218

            Show
            stephenconnolly Stephen Connolly added a comment - By any chance are you provisioning and de-provisioning other unrelated slaves? I ask as a slave disconnect, if incorrectly timed and unlucky enough can cause NioChannelHub to livelock ( JENKINS-25218 ) trying to process the channel termination for the disconnected slave until the ping thread kills that channel and breaks the livelock. Disabling the ping threads on the slave and the master should either provide an alternative workaround or confirm that this issue is JENKINS-25218
            Hide
            legolas Arnt Witteveen added a comment -

            At the time this first occurred, in my case we had only 1 slave, so that means that at least for me that was probably not the cause if I understand correctly?

            Show
            legolas Arnt Witteveen added a comment - At the time this first occurred, in my case we had only 1 slave, so that means that at least for me that was probably not the cause if I understand correctly?
            Hide
            stephenconnolly Stephen Connolly added a comment -

            Arnt Witteveen correct if you have one and only one slave and you are not using one of the cloud providers and you are not using a retention strategy that disconnects when idle then this issue is not JENKINS-25218.

            Show
            stephenconnolly Stephen Connolly added a comment - Arnt Witteveen correct if you have one and only one slave and you are not using one of the cloud providers and you are not using a retention strategy that disconnects when idle then this issue is not JENKINS-25218 .
            Hide
            emangual Ed Mangual added a comment -

            I have the same issues with frequent disconnects with Jenkins Master/Slaves VMs hosted on ESX with VMotion enabled. This is common in IT organizations nowadays. What are we doing to get a permanent solution to this issue?

            Daniel Serodio Can you tell us if disabling VMotion helped?

            In the meantime I will have my ops team disable vmotion as well.

            Show
            emangual Ed Mangual added a comment - I have the same issues with frequent disconnects with Jenkins Master/Slaves VMs hosted on ESX with VMotion enabled. This is common in IT organizations nowadays. What are we doing to get a permanent solution to this issue? Daniel Serodio Can you tell us if disabling VMotion helped? In the meantime I will have my ops team disable vmotion as well.
            Hide
            dserodio Daniel Serodio added a comment -

            Ed Mangual unfortunately, I found out that VMotion was already disabled. I did find out that our corporate firewall closes connections if the connection is idle, but from what I understand of Jenkins remoting I think the "ping thread" should prevent the connection from idling. Still investigating.

            Show
            dserodio Daniel Serodio added a comment - Ed Mangual unfortunately, I found out that VMotion was already disabled. I did find out that our corporate firewall closes connections if the connection is idle, but from what I understand of Jenkins remoting I think the "ping thread" should prevent the connection from idling. Still investigating.
            Hide
            jaykah Jay Kah added a comment -

            We are having the same issue on Jenkins ver. 1.634 on Ubuntu Trusty.

            Show
            jaykah Jay Kah added a comment - We are having the same issue on Jenkins ver. 1.634 on Ubuntu Trusty.
            Hide
            flow86 Florian Doersch added a comment -

            we're having the same issue with 1.638 and "real" (not virtualized) hosts

            Show
            flow86 Florian Doersch added a comment - we're having the same issue with 1.638 and "real" (not virtualized) hosts
            Hide
            moshe_zvi Moshe Zvi added a comment -

            +1 Florian

            Show
            moshe_zvi Moshe Zvi added a comment - +1 Florian
            Hide
            dariogriffo Dario Griffo added a comment -

            Same issue for me. Below some log from the slave.
            I can get small builds to end fine (less than 2 minutes), but we are running automation tests running in this box (fixed job) and takes ~30 minutes.
            This started last week, without any changes in the boxes, no updates, no infrastructure changes.
            The build has been working fine for 2 months.
            Master: Windows 2012 64-bit R2 Datacenter
            Slave: Windows 7 64-bit Enterprise

            Slave log (all look the same)
            INFO: Trying protocol: JNLP2-connect
            Jan 18, 2016 12:49:34 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connected
            Jan 18, 2016 12:59:54 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
            SEVERE: I/O error in channel channel
            java.net.SocketException: Software caused connection abort: recv failed
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.socketRead(Unknown Source)
            at java.net.SocketInputStream.read(Unknown Source)
            at java.net.SocketInputStream.read(Unknown Source)
            at java.io.BufferedInputStream.fill(Unknown Source)
            at java.io.BufferedInputStream.read(Unknown Source)
            at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
            at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
            at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
            at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
            at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
            at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

            Jan 18, 2016 12:59:54 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Terminated
            Jan 18, 2016 1:00:04 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect
            INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@c2400d
            Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main createEngine
            INFO: Setting up slave: vostro-470-002
            Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener <init>
            INFO: Jenkins agent is running in headless mode.
            Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Locating server among http://168.0.0.228:7542/
            Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Handshaking
            Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connecting to ci.teamhours.com:50081
            Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Trying protocol: JNLP2-connect
            Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connected

            Show
            dariogriffo Dario Griffo added a comment - Same issue for me. Below some log from the slave. I can get small builds to end fine (less than 2 minutes), but we are running automation tests running in this box (fixed job) and takes ~30 minutes. This started last week, without any changes in the boxes, no updates, no infrastructure changes. The build has been working fine for 2 months. Master: Windows 2012 64-bit R2 Datacenter Slave: Windows 7 64-bit Enterprise Slave log (all look the same) INFO: Trying protocol: JNLP2-connect Jan 18, 2016 12:49:34 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Jan 18, 2016 12:59:54 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.net.SocketException: Software caused connection abort: recv failed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) Jan 18, 2016 12:59:54 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Jan 18, 2016 1:00:04 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@c2400d Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: vostro-470-002 Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among http://168.0.0.228:7542/ Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to ci.teamhours.com:50081 Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP2-connect Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected
            Hide
            dariogriffo Dario Griffo added a comment -

            Florian Doersch Arnt Witteveen
            I have been running again builds for 4 hours without any disconnection.
            My "fix": stop using the slave windows service and run this as a scheduled task running a bat file.
            First I downloaded the slave.jar and slave-agent.jnlp into an empty folder (C:\Jenkins).
            Created the following .bat in that folder

            cd C:\Jenkins
            java -jar slave.jar -jnlpUrl http:/MY_JENKINS_URL/computer/MY_SLAVE_NAME/slave-agent.jnlp -secret MYSECRET

            That last line is the one in the agent page
            Create a scheduled task to run at startup with same credentials as the uninstall windows slave service.
            So far, no disconnections, when I used to have them every 4~5 minutes.

            Show
            dariogriffo Dario Griffo added a comment - Florian Doersch Arnt Witteveen I have been running again builds for 4 hours without any disconnection. My "fix": stop using the slave windows service and run this as a scheduled task running a bat file. First I downloaded the slave.jar and slave-agent.jnlp into an empty folder (C:\Jenkins). Created the following .bat in that folder cd C:\Jenkins java -jar slave.jar -jnlpUrl http:/MY_JENKINS_URL/computer/MY_SLAVE_NAME/slave-agent.jnlp -secret MYSECRET That last line is the one in the agent page Create a scheduled task to run at startup with same credentials as the uninstall windows slave service. So far, no disconnections, when I used to have them every 4~5 minutes.
            Hide
            jfemia James Femia added a comment - - edited

            We are having the same issue with slaves running on Win10 in VMWare, already running slave.jar as a scheduled task on login so the suggested workaround does not help us

            Show
            jfemia James Femia added a comment - - edited We are having the same issue with slaves running on Win10 in VMWare, already running slave.jar as a scheduled task on login so the suggested workaround does not help us
            Hide
            chr0n1x Kevin R. added a comment -

            bump

            Show
            chr0n1x Kevin R. added a comment - bump
            Hide
            sithmein Thorsten Meinl added a comment -

            Since we switched from freestyle jobs to pipelines (a very decision if you as me), about every second jobs fails with this error which makes Windows slaves pretty much unusable. It has never been a problem before, therefore my guess is that is has something to do with pipelines. Here's a strack trace, which always looks the same:

            java.io.IOException: remote file operation failed: C:\Users\jenkins\slave\workspace\g.knime.product.full_master-6YACVMZPMCFFZKR6NNQZEQAJGOVHWVVS2MSXJGRRFPLNAY2WNRWQ at hudson.remoting.Channel@2eed4a92:Channel to /xxx.xxx.xxx.xxx hudson.remoting.ChannelClosedException: channel is already closed
            	at hudson.FilePath.act(FilePath.java:992)
            	at hudson.FilePath.act(FilePath.java:974)
            	at hudson.FilePath.mkdirs(FilePath.java:1157)
            	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:77)
            	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:65)
            	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49)
            	at hudson.security.ACL.impersonate(ACL.java:221)
            	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            	at java.lang.Thread.run(Thread.java:745)
            Caused by: hudson.remoting.ChannelClosedException: channel is already closed
            	at hudson.remoting.Channel.send(Channel.java:604)
            	at hudson.remoting.Request.call(Request.java:130)
            	at hudson.remoting.Channel.call(Channel.java:821)
            	at hudson.FilePath.act(FilePath.java:985)
            	... 12 more
            Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed
            	at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
            	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
            	... 5 more
            Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested
            	at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426)
            	at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332)
            	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565)
            	... 6 more

            I'm happy to debug this further if someone tells me what to do.

            Show
            sithmein Thorsten Meinl added a comment - Since we switched from freestyle jobs to pipelines (a very decision if you as me), about every second jobs fails with this error which makes Windows slaves pretty much unusable. It has never been a problem before, therefore my guess is that is has something to do with pipelines. Here's a strack trace, which always looks the same: java.io.IOException: remote file operation failed: C:\Users\jenkins\slave\workspace\g.knime.product.full_master-6YACVMZPMCFFZKR6NNQZEQAJGOVHWVVS2MSXJGRRFPLNAY2WNRWQ at hudson.remoting.Channel@2eed4a92:Channel to /xxx.xxx.xxx.xxx hudson.remoting.ChannelClosedException: channel is already closed at hudson.FilePath.act(FilePath.java:992) at hudson.FilePath.act(FilePath.java:974) at hudson.FilePath.mkdirs(FilePath.java:1157) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:77) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:65) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49) at hudson.security.ACL.impersonate(ACL.java:221) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:604) at hudson.remoting.Request.call(Request.java:130) at hudson.remoting.Channel.call(Channel.java:821) at hudson.FilePath.act(FilePath.java:985) ... 12 more Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) ... 5 more Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565) ... 6 more I'm happy to debug this further if someone tells me what to do.
            Hide
            dcsobral Daniel Sobral added a comment -

            My maven builds were plagued by this, so it's not pipeline-related. I don't see it much with pipelines, though I wrote my pipelines with retry from the start, because I expected that very problem to persist.

            Show
            dcsobral Daniel Sobral added a comment - My maven builds were plagued by this, so it's not pipeline-related. I don't see it much with pipelines, though I wrote my pipelines with retry from the start, because I expected that very problem to persist.
            Hide
            sithmein Thorsten Meinl added a comment -

            I have a vague feeling that this problem occurs when there is more than one job running on the slave. If I have time, I will try to restrict the Windows slaves to one slot and see if this makes the problem go away. Maybe some of you can try this, too.

            Show
            sithmein Thorsten Meinl added a comment - I have a vague feeling that this problem occurs when there is more than one job running on the slave. If I have time, I will try to restrict the Windows slaves to one slot and see if this makes the problem go away. Maybe some of you can try this, too.
            Hide
            stoiky Mihai Stoichitescu added a comment -

            Thorsten Meinl we encounter this error on machines that only have 1 executor setup.

            Maybe this happens if there are flyweight executors running on the node?

            Show
            stoiky Mihai Stoichitescu added a comment - Thorsten Meinl we encounter this error on machines that only have 1 executor setup. Maybe this happens if there are flyweight executors running on the node?
            Hide
            stuartjsmith Stuart Smith added a comment -

            We have frequent disconnects on two of our slave machines, both are located in geographically the same location, and no other machines in any other location suffer from this, so we are suspecting that it is maybe something firewall related in that location. The infra guys have initially suggested that the firewall is killing idle connections, and have increased the timeout, but we are still seeing frequent disconnections. Is the JNLP connection really idle between slave and master if no builds are running? Our call stack is as follows (Jenkins Master 2.46.1):

            JNLP agent connected from xx Slave.jar version: 3.4.1 This is a Windows agent Agent successfully connected and online ERROR: Connection terminated [8mha:////4EVWKVufSpoBsjG/AK97kvCQst6o1LLM9fjogkB0XVcIAAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=[0mjava.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179) at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

            Show
            stuartjsmith Stuart Smith added a comment - We have frequent disconnects on two of our slave machines, both are located in geographically the same location, and no other machines in any other location suffer from this, so we are suspecting that it is maybe something firewall related in that location. The infra guys have initially suggested that the firewall is killing idle connections, and have increased the timeout, but we are still seeing frequent disconnections. Is the JNLP connection really idle between slave and master if no builds are running? Our call stack is as follows (Jenkins Master 2.46.1): JNLP agent connected from xx Slave.jar version: 3.4.1 This is a Windows agent Agent successfully connected and online ERROR: Connection terminated [8mha:////4EVWKVufSpoBsjG/AK97kvCQst6o1LLM9fjogkB0XVcIAAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=[0mjava.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179) at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)
            Hide
            stephenconnolly Stephen Connolly added a comment -

            If no builds are running then the connection can be idle though given you are running JNLP4 then the TLS transport layer should be trickling some data periodically... OTOH perhaps that periodic trickle is actually detecting a network issue

            Show
            stephenconnolly Stephen Connolly added a comment - If no builds are running then the connection can be idle though given you are running JNLP4 then the TLS transport layer should be trickling some data periodically... OTOH perhaps that periodic trickle is actually detecting a network issue
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Stuart Smith "RecvClosed()" means that the Tcp session is interrupted by the operating system or by any router between Jenkins master and agent. My first recommendation would be to increase Tcp retransmission timeout in the operating system. If it does not help, then you need to identify which node in your network interrupts the session. It can be done by analyzing Tcp dumps 

            Show
            oleg_nenashev Oleg Nenashev added a comment - Stuart Smith "RecvClosed()" means that the Tcp session is interrupted by the operating system or by any router between Jenkins master and agent. My first recommendation would be to increase Tcp retransmission timeout in the operating system. If it does not help, then you need to identify which node in your network interrupts the session. It can be done by analyzing Tcp dumps 
            Hide
            dserodio Daniel Serodio added a comment -

            Doesn't the "ping thread" prevent the connection from becoming idle?

            Show
            dserodio Daniel Serodio added a comment - Doesn't the " ping thread " prevent the connection from becoming idle?
            Hide
            heiko_nardmann Heiko Nardmann added a comment -

            I also get this problem. Just one slave is affected (Windows 10) and this slave is running two agents based on separate folders. One agent is intended to be used to provide a build slave, the other one is used for providing a smoke test slave (which does not need a high performance).

            Show
            heiko_nardmann Heiko Nardmann added a comment - I also get this problem. Just one slave is affected (Windows 10) and this slave is running two agents based on separate folders. One agent is intended to be used to provide a build slave, the other one is used for providing a smoke test slave (which does not need a high performance).

              People

              • Assignee:
                Unassigned
                Reporter:
                legolas Arnt Witteveen
              • Votes:
                41 Vote for this issue
                Watchers:
                52 Start watching this issue

                Dates

                • Created:
                  Updated: