Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22932

Jenkins slave cannot reconnect to Master once it has been disconnected unless Jenkins is restarted

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      When using a Windows Jenkins slave with an OSX Master (with the slave set up according to https://wiki.jenkins-ci.org/display/JENKINS/Step+by+step+guide+to+set+up+master+and+slave+machines) either disconnecting from the slave side or from the master (by selecting 'disconnect' from Nodes > NodeName), the slave then cannot reconnect until the master jenkins is restarted and an error is shown in the node information. This is extremely inconvenient as it means that the slave machine must be accessed every time the connection is interrupted (eg. a restart of jenkins or master machine). The following stack trace is seen on disconnect:

      Connection was broken

      java.io.IOException: Failed to abort
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
      at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      at java.lang.Thread.run(Thread.java:695)
      Caused by: java.nio.channels.ClosedChannelException
      at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:663)
      at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
      at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
      at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226)
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474)

        Attachments

          Issue Links

            Activity

            Hide
            hangdong Hang Dong added a comment -

            seeing this on windows master with 1.620, when adding new node, we typically connect via jnlp link, then install as service. We hit the issue onthe service client re-connect. Perhaps this helps: due to https secured master, the first service connect won't have valid cert info (and we suspect this triggers the issue master side), we update xml with certificate info then stop/restart the service, but at this stage the master is already in a bad state (not only the new slave cannot reconnect), the master actually loses connection to all other slaves as well. Our workaround so far is restarting master...

            10:17:07 java.io.IOException: remote file operation failed: C:\JSBuilds\workspace****************** at hudson.remoting.Channel@1530a3e:********: hudson.remoting.ChannelClosedException: channel is already closed
            10:17:07 at hudson.FilePath.act(FilePath.java:987)
            10:17:07 at hudson.FilePath.act(FilePath.java:969)
            10:17:07 at hudson.FilePath.mkdirs(FilePath.java:1152)
            10:17:07 at hudson.model.AbstractProject.checkout(AbstractProject.java:1275)
            10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610)
            10:17:07 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
            10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532)
            10:17:07 at hudson.model.Run.execute(Run.java:1741)
            10:17:07 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
            10:17:07 at hudson.model.ResourceController.execute(ResourceController.java:98)
            10:17:07 at hudson.model.Executor.run(Executor.java:381)
            10:17:07 Caused by: hudson.remoting.ChannelClosedException: channel is already closed
            10:17:07 at hudson.remoting.Channel.send(Channel.java:550)
            10:17:07 at hudson.remoting.Request.call(Request.java:129)
            10:17:07 at hudson.remoting.Channel.call(Channel.java:752)
            10:17:07 at hudson.FilePath.act(FilePath.java:980)
            10:17:07 ... 10 more
            10:17:07 Caused by: java.io.IOException
            10:17:07 at hudson.remoting.Channel.close(Channel.java:1110)
            10:17:07 at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118)
            10:17:07 at hudson.remoting.PingThread.ping(PingThread.java:126)
            10:17:07 at hudson.remoting.PingThread.run(PingThread.java:85)
            10:17:07 Caused by: java.util.concurrent.TimeoutException: Ping started at 1441990735275 hasn't completed by 1441990975286

            Show
            hangdong Hang Dong added a comment - seeing this on windows master with 1.620, when adding new node, we typically connect via jnlp link, then install as service. We hit the issue onthe service client re-connect. Perhaps this helps: due to https secured master, the first service connect won't have valid cert info (and we suspect this triggers the issue master side), we update xml with certificate info then stop/restart the service, but at this stage the master is already in a bad state (not only the new slave cannot reconnect), the master actually loses connection to all other slaves as well. Our workaround so far is restarting master... 10:17:07 java.io.IOException: remote file operation failed: C:\JSBuilds\workspace****************** at hudson.remoting.Channel@1530a3e:********: hudson.remoting.ChannelClosedException: channel is already closed 10:17:07 at hudson.FilePath.act(FilePath.java:987) 10:17:07 at hudson.FilePath.act(FilePath.java:969) 10:17:07 at hudson.FilePath.mkdirs(FilePath.java:1152) 10:17:07 at hudson.model.AbstractProject.checkout(AbstractProject.java:1275) 10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610) 10:17:07 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86) 10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532) 10:17:07 at hudson.model.Run.execute(Run.java:1741) 10:17:07 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 10:17:07 at hudson.model.ResourceController.execute(ResourceController.java:98) 10:17:07 at hudson.model.Executor.run(Executor.java:381) 10:17:07 Caused by: hudson.remoting.ChannelClosedException: channel is already closed 10:17:07 at hudson.remoting.Channel.send(Channel.java:550) 10:17:07 at hudson.remoting.Request.call(Request.java:129) 10:17:07 at hudson.remoting.Channel.call(Channel.java:752) 10:17:07 at hudson.FilePath.act(FilePath.java:980) 10:17:07 ... 10 more 10:17:07 Caused by: java.io.IOException 10:17:07 at hudson.remoting.Channel.close(Channel.java:1110) 10:17:07 at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118) 10:17:07 at hudson.remoting.PingThread.ping(PingThread.java:126) 10:17:07 at hudson.remoting.PingThread.run(PingThread.java:85) 10:17:07 Caused by: java.util.concurrent.TimeoutException: Ping started at 1441990735275 hasn't completed by 1441990975286
            Hide
            spatel Shesh Patel added a comment -

            Encounter this issue after upgrading jenkins version to 1.622. I am getting following error while connecting to windows slave. I am using "launch slave agents via Java Web Start" option to launch slave. It used to work fine in previous version of 1.597. It seems to be re-introduced, please follow up with suggested fix.

            java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@7029f3e3[name=windows_02]
            	at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
            	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628)
            	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            	at java.lang.Thread.run(Thread.java:745)
            Caused by: java.io.IOException: Connection reset by peer
            	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
            	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
            	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
            	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
            	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
            	at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
            	at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
            	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
            
            Show
            spatel Shesh Patel added a comment - Encounter this issue after upgrading jenkins version to 1.622. I am getting following error while connecting to windows slave. I am using "launch slave agents via Java Web Start" option to launch slave. It used to work fine in previous version of 1.597. It seems to be re-introduced, please follow up with suggested fix. java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@7029f3e3[name=windows_02] at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
            Hide
            bbl5660 Brian L added a comment - - edited

            This is affecting me as well.

            Master: Jenkins ver. 1.638, Ubuntu 14.04.3 LTS, running JRE 1.8.0_65-b17
            Slave: Windows Server 2008, connected via JNLP :

            
                Microsoft Windows [Version 6.1.7601]
                Copyright (c) 2009 Microsoft Corporation.  All rights reserved.
                
                C:\Users\Administrator>java -version
                java version "1.8.0_31"
                Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
                Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
            
            
            

            Do we have a workaround? I wonder if adding some Job configuration to programmatically kill the process running java ... -jar "...\slave.jar" might work?

            Show
            bbl5660 Brian L added a comment - - edited This is affecting me as well. Master: Jenkins ver. 1.638, Ubuntu 14.04.3 LTS, running JRE 1.8.0_65-b17 Slave: Windows Server 2008, connected via JNLP : Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\Administrator>java -version java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) Do we have a workaround? I wonder if adding some Job configuration to programmatically kill the process running java ... -jar "...\slave.jar" might work?
            Hide
            bbl5660 Brian L added a comment -

            I didn't have much luck with an actual patch, but in the meantime, here's the workaround I'm attempting to implement:

            1. Install the Groovy plugin
            2. Use this code as it's own Job :

            import jenkins.model.*
            
            println "The system is now going down for restart."
            println "Once the bug 'https://issues.jenkins-ci.org/browse/JENKINS-22932' is resolved, this job should be removed."
              
            Jenkins.instance.doSafeRestart(null);
            

            3. Have the job triggered after any of your Windows slaves finish doing work

            Show
            bbl5660 Brian L added a comment - I didn't have much luck with an actual patch, but in the meantime, here's the workaround I'm attempting to implement: 1. Install the Groovy plugin 2. Use this code as it's own Job : import jenkins.model.* println "The system is now going down for restart." println "Once the bug 'https: //issues.jenkins-ci.org/browse/JENKINS-22932' is resolved, this job should be removed." Jenkins.instance.doSafeRestart( null ); 3. Have the job triggered after any of your Windows slaves finish doing work
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

            Show
            oleg_nenashev Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

              People

              • Assignee:
                Unassigned
                Reporter:
                dcr dc r
              • Votes:
                36 Vote for this issue
                Watchers:
                57 Start watching this issue

                Dates

                • Created:
                  Updated: