Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-28492

The server rejected the connection: *** is already connected to this master. Rejecting this connection.

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Blocker
    • Resolution: Fixed
    • Component/s: core, remoting
    • Labels:
      None
    • Environment:
      Windows 7, Jenkins 1.613
    • Similar Issues:

      Description

      I am running Jenkins 1.613.

      i see the slaves log:
      四月 15, 2015 10:15:02 上午 com.youdevise.hudson.slavestatus.SlaveListener call
      信息: Slave-status listener starting
      四月 15, 2015 10:15:02 上午 com.youdevise.hudson.slavestatus.SocketHTTPListener waitForConnection
      信息: Slave-status listener ready on port 3141
      四月 15, 2015 1:41:01 下午 hudson.remoting.jnlp.Main createEngine
      信息: Setting up slave: 192.168.161.8
      四月 15, 2015 1:41:01 下午 hudson.remoting.jnlp.Main$CuiListener <init>
      信息: Jenkins agent is running in headless mode.
      四月 15, 2015 1:41:01 下午 hudson.remoting.jnlp.Main$CuiListener status
      信息: Locating server among http://192.168.95.37:8080/jenkins/
      四月 15, 2015 1:41:01 下午 hudson.remoting.jnlp.Main$CuiListener status
      信息: Connecting to 192.168.95.37:19994
      四月 15, 2015 1:41:01 下午 hudson.remoting.jnlp.Main$CuiListener status
      信息: Handshaking
      四月 15, 2015 1:41:01 下午 hudson.remoting.jnlp.Main$CuiListener error
      严重: The server rejected the connection: 192.168.161.8 is already connected to this master. Rejecting this connection.
      java.lang.Exception: The server rejected the connection: 192.168.161.8 is already connected to this master. Rejecting this connection.
      at hudson.remoting.Engine.onConnectionRejected(Engine.java:306)
      at hudson.remoting.Engine.run(Engine.java:276)

        Attachments

        1. channelclosedstack.txt
          1.30 MB
        2. jenkins-slave.err.log
          293 kB
        3. Slave errors
          2 kB

          Issue Links

            Activity

            Hide
            sahilsk sonu kumar added a comment -

            any help here would be great

            Show
            sahilsk sonu kumar added a comment - any help here would be great
            Hide
            slide_o_mix Alex Earl added a comment -

            Can you verify that you don't have another instance of the slave.jar running on your slave computer? Either that, or you may have two computers with the same IP address.

            Show
            slide_o_mix Alex Earl added a comment - Can you verify that you don't have another instance of the slave.jar running on your slave computer? Either that, or you may have two computers with the same IP address.
            Hide
            danielbeck Daniel Beck added a comment -

            I think the "IP address" here may be the slave name? If so, there may also be two machines claiming to be the same slave.

            Show
            danielbeck Daniel Beck added a comment - I think the "IP address" here may be the slave name? If so, there may also be two machines claiming to be the same slave.
            Hide
            dwooster Douglas Wooster added a comment -

            This issue is reporting the same symptoms as JENKINS-5055, which was thought to have been fixed back before the Hudson/Jenkins split, but now has multiple people reporting that the problem is back.

            Show
            dwooster Douglas Wooster added a comment - This issue is reporting the same symptoms as JENKINS-5055 , which was thought to have been fixed back before the Hudson/Jenkins split, but now has multiple people reporting that the problem is back.
            Hide
            dwooster Douglas Wooster added a comment - - edited

            I am seeing the exact same thing with Jenkins 1.574, after reboot of the slave.
            We download the slave.jar from the master before every startup so we know it is current.
            Both slaves and master are RHEL 6.4 Linux.
            Slaves are started headless from an init script when the machine is booted.
            Error occurs about 1/3 of the time.

            Show
            dwooster Douglas Wooster added a comment - - edited I am seeing the exact same thing with Jenkins 1.574, after reboot of the slave. We download the slave.jar from the master before every startup so we know it is current. Both slaves and master are RHEL 6.4 Linux. Slaves are started headless from an init script when the machine is booted. Error occurs about 1/3 of the time.
            Hide
            mmitche Matthew Mitchell added a comment - - edited

            I'm also seeing this on Jenkins 1.642.3. In addition, if the connection is explicitly stopped on the slave side, the master still shows the slave as connected.

            Show
            mmitche Matthew Mitchell added a comment - - edited I'm also seeing this on Jenkins 1.642.3. In addition, if the connection is explicitly stopped on the slave side, the master still shows the slave as connected.
            Hide
            mmitche Matthew Mitchell added a comment - - edited

            Also, we didn't use it see this at all, but there appears to be a large uptick in instances since moving towards more cloud allocated machines (azure-slave-plugin) though the Azure slave plugin machine do not ever see the issue. I have also never seen it on SSH connected machines.

            Show
            mmitche Matthew Mitchell added a comment - - edited Also, we didn't use it see this at all, but there appears to be a large uptick in instances since moving towards more cloud allocated machines (azure-slave-plugin) though the Azure slave plugin machine do not ever see the issue. I have also never seen it on SSH connected machines.
            Hide
            mmitche Matthew Mitchell added a comment -

            I got a full thread dump channelclosedstack.txt while this was happening and am invstigating, since for whatever reason this is hitting us full bore right now.

            Show
            mmitche Matthew Mitchell added a comment - I got a full thread dump channelclosedstack.txt while this was happening and am invstigating, since for whatever reason this is hitting us full bore right now.
            Hide
            mmitche Matthew Mitchell added a comment -

            Douglass, do you know anything more about the intallations where we are seeing this? I think I am seeing a pattern in my repros that has to do with slave deletion (and maybe creation). I think for us this correlates with increased usage of the azure cloud plugin

            Show
            mmitche Matthew Mitchell added a comment - Douglass, do you know anything more about the intallations where we are seeing this? I think I am seeing a pattern in my repros that has to do with slave deletion (and maybe creation). I think for us this correlates with increased usage of the azure cloud plugin
            Hide
            mgaborit Martin Gaborit added a comment - - edited

            We are seeing the same issue with Jenkins 1.634.
            The master is a Debian Jessie and the slave is a Windows Server 2012 R2 connected via JNLP.
            After rebooting the Windows slave, it tries to connect to the master exactly 1030 times (we reproduced a few times). On the attempt 1031, it connects correctly.

            In addition, as Matthew Mitchell said, the master doesn't seem to be aware that the slave was disconnected.

            Show
            mgaborit Martin Gaborit added a comment - - edited We are seeing the same issue with Jenkins 1.634. The master is a Debian Jessie and the slave is a Windows Server 2012 R2 connected via JNLP. After rebooting the Windows slave, it tries to connect to the master exactly 1030 times (we reproduced a few times). On the attempt 1031, it connects correctly. In addition, as Matthew Mitchell said, the master doesn't seem to be aware that the slave was disconnected.
            Hide
            dwooster Douglas Wooster added a comment - - edited

            Hi, Matthew,
            Sorry I took so long - Jenkins-ci.org authentication server was down when I tried before.

            Jenkins server is 1.574 running on Tomcat 7.0.63 in a virtual machine running RHEL 6.4 Linux. Slaves are also on RHEL 6.4. Mix of virtual machines and real physical machines. No cloud involved. All slaves run headless and are started by a script on the slave machines (script can be run manually or automatically during boot). I think we only see this problem when the machine the slave is running on is rebooted.

            We do NOT use azure-cloud-plugin. All of the slaves currently in use are on Linux. We have one Windows slave defined, but we have not attempted to run that slave for several months.

            We have been running Jenkins 1.574 for about two years. The problem was first noticed when we added slave startup to boot autostart scripts in late 2015.

            Show
            dwooster Douglas Wooster added a comment - - edited Hi, Matthew, Sorry I took so long - Jenkins-ci.org authentication server was down when I tried before. Jenkins server is 1.574 running on Tomcat 7.0.63 in a virtual machine running RHEL 6.4 Linux. Slaves are also on RHEL 6.4. Mix of virtual machines and real physical machines. No cloud involved. All slaves run headless and are started by a script on the slave machines (script can be run manually or automatically during boot). I think we only see this problem when the machine the slave is running on is rebooted. We do NOT use azure-cloud-plugin. All of the slaves currently in use are on Linux. We have one Windows slave defined, but we have not attempted to run that slave for several months. We have been running Jenkins 1.574 for about two years. The problem was first noticed when we added slave startup to boot autostart scripts in late 2015.
            Hide
            mmitche Matthew Mitchell added a comment -

            So I think what's going on is that certain kind of network outages (or maybe a reboot) are causing issues with the master realizing that the connection is closed.

            I think I see one place where there is a bug:

            https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/slaves/ChannelPinger.java#L111

            I've seen the call into the PingFailureAnalyzer return NPE, causing us to miss the call to terminate the channel. However, unless I'm misunderstanding the code, I think that's executing on the client, which wouldn't cause the server to keep the channel open.

            Do you have any special ping setting set on Jenkins startup?

            Show
            mmitche Matthew Mitchell added a comment - So I think what's going on is that certain kind of network outages (or maybe a reboot) are causing issues with the master realizing that the connection is closed. I think I see one place where there is a bug: https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/slaves/ChannelPinger.java#L111 I've seen the call into the PingFailureAnalyzer return NPE, causing us to miss the call to terminate the channel. However, unless I'm misunderstanding the code, I think that's executing on the client, which wouldn't cause the server to keep the channel open. Do you have any special ping setting set on Jenkins startup?
            Hide
            danielbeck Daniel Beck added a comment -

            Oleg Nenashev Maybe worth it to take a look at this?

            Show
            danielbeck Daniel Beck added a comment - Oleg Nenashev Maybe worth it to take a look at this?
            Hide
            dwooster Douglas Wooster added a comment -

            I'm not aware of any special ping setting on Master startup. How would I check?

            There's no special operands on the slave startup. Just -jnlpUrl pointing the Master.

            Looking at ChannelPinger.java, I think it executes on BOTH the slave and the master. See lines 88-104.

            Show
            dwooster Douglas Wooster added a comment - I'm not aware of any special ping setting on Master startup. How would I check? There's no special operands on the slave startup. Just -jnlpUrl pointing the Master. Looking at ChannelPinger.java, I think it executes on BOTH the slave and the master. See lines 88-104.
            Hide
            mmitche Matthew Mitchell added a comment -

            The Launcher code in remoting (that has all the client side options, also launches a pinger. So I think there is a pinger from both sides, but not the exact same code)

            Show
            mmitche Matthew Mitchell added a comment - The Launcher code in remoting (that has all the client side options, also launches a pinger. So I think there is a pinger from both sides, but not the exact same code)
            Hide
            mmitche Matthew Mitchell added a comment -

            I'm trying to set up a fake repro for this where the client will always fail and attempt to close the channel to see whether I can get the master to have the same issue seen in the wild.

            Show
            mmitche Matthew Mitchell added a comment - I'm trying to set up a fake repro for this where the client will always fail and attempt to close the channel to see whether I can get the master to have the same issue seen in the wild.
            Hide
            mmitche Matthew Mitchell added a comment -

            Alright, I was unable to get an exact repo, but I think I saw enough, plus combined with the logs, to figure out what happens:

            1) The PingFailureAnalyzer call in onDead in ChannelPinger.java will occasionally fail for various reasons. NPE's, findClass failures. Not always, but sometimes. When this happens, the master will NOT call close on the channel.
            2) Normally this appears to be fine. The channel is somehow replaced or null'd out through other means. I think in cases of process termination or other semi-orderly shutdowns the socket connection is notified.
            3) However, let's say there is a network anomoly. Power outage, network cable unplugged, dropped connection not on the server/client side. In this case, the master will notice, potentially by noticing a failed ping. If it fails in the PingFailureAnalyzer code, it won't close the channel.
            4) The slave comes back, say from a reboot, or the network cable is reinstalled, etc. and attempts to reconnect. The channel is not null, and we get the error.

            I think the keys to the repro are the lack of a "Terminate" text in the slave log and the definite issue of not closing the channel when an exception is seen in the PFA. The lack of terminate indicates there was not an orderly shutdown of the channel on the client side.

            So, the fix would be to wrap the call to the PFA in a try catch to ensure that channel.close() is in fact called.

            The issues I was seeing in my installation have subsided, but this fix was made as they were tailing off, and I think I did not see any already connected errors after that.

            Show
            mmitche Matthew Mitchell added a comment - Alright, I was unable to get an exact repo, but I think I saw enough, plus combined with the logs, to figure out what happens: 1) The PingFailureAnalyzer call in onDead in ChannelPinger.java will occasionally fail for various reasons. NPE's, findClass failures. Not always, but sometimes. When this happens, the master will NOT call close on the channel. 2) Normally this appears to be fine. The channel is somehow replaced or null'd out through other means. I think in cases of process termination or other semi-orderly shutdowns the socket connection is notified. 3) However, let's say there is a network anomoly. Power outage, network cable unplugged, dropped connection not on the server/client side. In this case, the master will notice, potentially by noticing a failed ping. If it fails in the PingFailureAnalyzer code, it won't close the channel. 4) The slave comes back, say from a reboot, or the network cable is reinstalled, etc. and attempts to reconnect. The channel is not null, and we get the error. I think the keys to the repro are the lack of a "Terminate" text in the slave log and the definite issue of not closing the channel when an exception is seen in the PFA. The lack of terminate indicates there was not an orderly shutdown of the channel on the client side. So, the fix would be to wrap the call to the PFA in a try catch to ensure that channel.close() is in fact called. The issues I was seeing in my installation have subsided, but this fix was made as they were tailing off, and I think I did not see any already connected errors after that.
            Hide
            lil2006 Igor Tereshchuk added a comment - - edited

            Hi,
            Run into same issue on Windows 2012 R2 and Slave.jar version: 2.49
            Master Log:
            <===[JENKINS REMOTING CAPACITY]===>Slave.jar version: 2.49
            This is a Windows slave
            Slave successfully connected and online
            ERROR: Connection terminated
            java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@186a19a4[name=*****]
            at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:211)
            at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:631)
            at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
            at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
            at java.util.concurrent.FutureTask.run(Unknown Source)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            at java.lang.Thread.run(Unknown Source)
            Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
            at sun.nio.ch.SocketDispatcher.read0(Native Method)
            at sun.nio.ch.SocketDispatcher.read(Unknown Source)
            at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
            at sun.nio.ch.IOUtil.read(Unknown Source)
            at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
            at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
            at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
            at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:564)
            ... 6 more

            Slave log:
            Apr 07, 2016 9:41:00 AM hudson.remoting.jnlp.Main$CuiListener error
            SEVERE: The server rejected the connection: <VMNAME> is already connected to this master. Rejecting this connection.
            java.lang.Exception: The server rejected the connection: <VMNAME> is already connected to this master. Rejecting this connection.
            at hudson.remoting.Engine.onConnectionRejected(Engine.java:306)
            at hudson.remoting.Engine.run(Engine.java:276)

            jenkins-slave.wrapper log :
            each second start new process.

            In system event :
            The Jenkins Slave service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 0 milliseconds: Restart the service.

            Each second i see this log during 1 min, then connection lost and back to online immediately

            Show
            lil2006 Igor Tereshchuk added a comment - - edited Hi, Run into same issue on Windows 2012 R2 and Slave.jar version: 2.49 Master Log: <=== [JENKINS REMOTING CAPACITY] ===>Slave.jar version: 2.49 This is a Windows slave Slave successfully connected and online ERROR: Connection terminated java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@186a19a4 [name=*****] at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:211) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:631) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host at sun.nio.ch.SocketDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(Unknown Source) at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.read(Unknown Source) at sun.nio.ch.SocketChannelImpl.read(Unknown Source) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:564) ... 6 more Slave log: Apr 07, 2016 9:41:00 AM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: <VMNAME> is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: <VMNAME> is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276) jenkins-slave.wrapper log : each second start new process. In system event : The Jenkins Slave service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 0 milliseconds: Restart the service. Each second i see this log during 1 min, then connection lost and back to online immediately
            Hide
            lil2006 Igor Tereshchuk added a comment -

            In my case root can be in too fast reboot client and master cant detect that slave was in shutdown , in result after reboot client trying start connection while master see that client still online.
            I made sleep before start service, that master can determine that slave already offline. will see is it helps or no.

            Show
            lil2006 Igor Tereshchuk added a comment - In my case root can be in too fast reboot client and master cant detect that slave was in shutdown , in result after reboot client trying start connection while master see that client still online. I made sleep before start service, that master can determine that slave already offline. will see is it helps or no.
            Hide
            pjohnsonic509 Phil Johnson added a comment - - edited

            Hello. I'm getting a similar issue.
            Server Log:

            May 11, 2016 2:39:30 PM hudson.slaves.ChannelPinger$1 onDead
            INFO: Ping failed. Terminating the channel build7.
            java.util.concurrent.TimeoutException: Ping started at 1462995330113 hasn't completed by 1462995570113
            Seems (although certainly not absolute) network-ish ??

            Client Log:
            2016-05-11 14:39:45 - Stopping jenkinsslave-D__Jenkins
            2016-05-11 14:39:45 - ProcessKill 3608
            2016-05-11 14:39:45 - Found child process: 4432 Name: conhost.exe
            2016-05-11 14:39:45 - Stopping process 4432
            2016-05-11 14:39:45 - Send SIGINT 4432
            2016-05-11 14:39:45 - SIGINT to 4432 failed - Killing as fallback
            2016-05-11 14:39:45 - Stopping process 3608
            2016-05-11 14:39:45 - Send SIGINT 3608
            2016-05-11 14:39:45 - SIGINT to 3608 failed - Killing as fallback
            2016-05-11 14:39:45 - Finished jenkinsslave-D__Jenkins
            2016-05-11 14:39:46 - Starting C:\Program Files\Java\jre7\bin\java.exe -Xrs -jar "D:\Jenkins\slave.jar" -jnlpUrl <server>
            2016-05-11 14:39:46 - Started 2348

            Show
            pjohnsonic509 Phil Johnson added a comment - - edited Hello. I'm getting a similar issue. Server Log: May 11, 2016 2:39:30 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel build7. java.util.concurrent.TimeoutException: Ping started at 1462995330113 hasn't completed by 1462995570113 Seems (although certainly not absolute) network-ish ?? Client Log: 2016-05-11 14:39:45 - Stopping jenkinsslave-D__Jenkins 2016-05-11 14:39:45 - ProcessKill 3608 2016-05-11 14:39:45 - Found child process: 4432 Name: conhost.exe 2016-05-11 14:39:45 - Stopping process 4432 2016-05-11 14:39:45 - Send SIGINT 4432 2016-05-11 14:39:45 - SIGINT to 4432 failed - Killing as fallback 2016-05-11 14:39:45 - Stopping process 3608 2016-05-11 14:39:45 - Send SIGINT 3608 2016-05-11 14:39:45 - SIGINT to 3608 failed - Killing as fallback 2016-05-11 14:39:45 - Finished jenkinsslave-D__Jenkins 2016-05-11 14:39:46 - Starting C:\Program Files\Java\jre7\bin\java.exe -Xrs -jar "D:\Jenkins\slave.jar" -jnlpUrl <server> 2016-05-11 14:39:46 - Started 2348
            Hide
            cheecheeo J C added a comment -

            Also seing this issue:

            May 16, 2016 10:47:09 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
            INFO: Accepted connection #366831 from /10.180.20.162:57952
            May 16, 2016 10:47:09 PM jenkins.slaves.JnlpSlaveHandshake error
            WARNING: TCP slave agent connection handler #366831 with /10.180.20.162:57952 is aborted: ci-windows4 is already connected to this master. Rejecting this connection.
            May 16, 2016 10:47:09 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
            INFO: Accepted connection #366832 from /10.180.20.162:57953
            May 16, 2016 10:47:09 PM jenkins.slaves.JnlpSlaveHandshake error
            WARNING: TCP slave agent connection handler #366832 with /10.180.20.162:57953 is aborted: ci-windows4 is already connected to this master. Rejecting this connection.
            May 16, 2016 10:47:09 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
            INFO: Accepted connection #366833 from /10.180.20.162:57954
            May 16, 2016 10:47:09 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
            WARNING: Connection #366833 failed
            java.io.IOException: Connection reset by peer
                    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
                    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
                    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
                    at sun.nio.ch.IOUtil.read(IOUtil.java:197)
                    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
                    at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:192)
                    at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
                    at java.io.InputStream.read(InputStream.java:101)
                    at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
                    at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:337)
                    at java.io.DataInputStream.readUTF(DataInputStream.java:589)
                    at java.io.DataInputStream.readUTF(DataInputStream.java:564)
                    at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:150)
            
            Show
            cheecheeo J C added a comment - Also seing this issue: May 16, 2016 10:47:09 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #366831 from /10.180.20.162:57952 May 16, 2016 10:47:09 PM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #366831 with /10.180.20.162:57952 is aborted: ci-windows4 is already connected to this master. Rejecting this connection. May 16, 2016 10:47:09 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #366832 from /10.180.20.162:57953 May 16, 2016 10:47:09 PM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #366832 with /10.180.20.162:57953 is aborted: ci-windows4 is already connected to this master. Rejecting this connection. May 16, 2016 10:47:09 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #366833 from /10.180.20.162:57954 May 16, 2016 10:47:09 PM hudson.TcpSlaveAgentListener$ConnectionHandler run WARNING: Connection #366833 failed java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:192) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) at java.io.InputStream.read(InputStream.java:101) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81) at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:337) at java.io.DataInputStream.readUTF(DataInputStream.java:589) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:150)
            Hide
            krogan mark mann added a comment -

            windows 2008R2 master and slaves
            master on Jenkins ver. 1.651.1
            Java JRE is mixture of 8u60, 8u77, 8u91

            I have >200 slaves and started seeing some of my "controller" slaves not wanting to start their jenkins slaves, especially when the slave has been reinstalled (as a refresh)
            the sample below is from a fresh slave, on 8u91. try to start the windows service hosting the slave and it stops immediately with the following log trace.

            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main createEngine
            INFO: Setting up slave: ComputerA
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener <init>
            INFO: Jenkins agent is running in headless mode.
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Locating server among http://10.20.1.181:8080/jenkins/, http://jenkins:8080/jenkins/
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Handshaking
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connecting to 10.20.1.181:55945
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Trying protocol: JNLP2-connect
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Server didn't understand the protocol: ComputerA is already connected to this master. Rejecting this connection.
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connecting to 10.20.1.181:55945
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Trying protocol: JNLP-connect
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Server didn't understand the protocol: ComputerA is already connected to this master. Rejecting this connection.
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connecting to 10.20.1.181:55945
            May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener error
            SEVERE: The server rejected the connection: None of the protocols were accepted
            java.lang.Exception: The server rejected the connection: None of the protocols were accepted
            at hudson.remoting.Engine.onConnectionRejected(Engine.java:297)
            at hudson.remoting.Engine.run(Engine.java:268)

            i've set the slave to "disconnected" on the master and then restart the service. sometimes it might then stay up, but not for long!!

            Show
            krogan mark mann added a comment - windows 2008R2 master and slaves master on Jenkins ver. 1.651.1 Java JRE is mixture of 8u60, 8u77, 8u91 I have >200 slaves and started seeing some of my "controller" slaves not wanting to start their jenkins slaves, especially when the slave has been reinstalled (as a refresh) the sample below is from a fresh slave, on 8u91. try to start the windows service hosting the slave and it stops immediately with the following log trace. May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: ComputerA May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among http://10.20.1.181:8080/jenkins/, http://jenkins:8080/jenkins/ May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to 10.20.1.181:55945 May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP2-connect May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Server didn't understand the protocol: ComputerA is already connected to this master. Rejecting this connection. May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to 10.20.1.181:55945 May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP-connect May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Server didn't understand the protocol: ComputerA is already connected to this master. Rejecting this connection. May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to 10.20.1.181:55945 May 17, 2016 5:45:58 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: None of the protocols were accepted java.lang.Exception: The server rejected the connection: None of the protocols were accepted at hudson.remoting.Engine.onConnectionRejected(Engine.java:297) at hudson.remoting.Engine.run(Engine.java:268) i've set the slave to "disconnected" on the master and then restart the service. sometimes it might then stay up, but not for long!!
            Hide
            oleg_nenashev Oleg Nenashev added a comment - - edited

            Added to my analysis queue (no time commitment).
            Issue with runaway processes in Windows services should be fixed by JENKINS-39231

            Show
            oleg_nenashev Oleg Nenashev added a comment - - edited Added to my analysis queue (no time commitment). Issue with runaway processes in Windows services should be fixed by JENKINS-39231
            Hide
            mmitche Matthew Mitchell added a comment -

            So I think the "already connected" may be caused by a cycle in the closing logic. Though I don't know why the error keeps propping up. It looks like the channel pinger could cause a channel close to happen. Under the jnlp remoting, we set up a channel with an onClose listener, which itself tries to close the channel under certain circumstances. These listeners are run in the channel termination finally logic, so it's possible that we could try to close while we are already under the close operations.

            Show
            mmitche Matthew Mitchell added a comment - So I think the "already connected" may be caused by a cycle in the closing logic. Though I don't know why the error keeps propping up. It looks like the channel pinger could cause a channel close to happen. Under the jnlp remoting, we set up a channel with an onClose listener, which itself tries to close the channel under certain circumstances. These listeners are run in the channel termination finally logic, so it's possible that we could try to close while we are already under the close operations.
            Hide
            mmitche Matthew Mitchell added a comment -

            hudson.remoting.ChannelClosedException: channel is already closed
            at hudson.remoting.Channel.send(Channel.java:578)
            at hudson.remoting.Request.callAsync(Request.java:205)
            at hudson.remoting.Channel.callAsync(Channel.java:807)
            at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:75)
            at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
            Caused by: java.io.IOException
            at hudson.remoting.Channel.close(Channel.java:1163)
            at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:123)
            at hudson.remoting.PingThread.ping(PingThread.java:126)
            at hudson.remoting.PingThread.run(PingThread.java:85)
            Caused by: java.util.concurrent.TimeoutException: Ping started at 1479165919577 hasn't completed by 1479166819577
            ... 2 more

            Show
            mmitche Matthew Mitchell added a comment - hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:578) at hudson.remoting.Request.callAsync(Request.java:205) at hudson.remoting.Channel.callAsync(Channel.java:807) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:75) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) Caused by: java.io.IOException at hudson.remoting.Channel.close(Channel.java:1163) at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:123) at hudson.remoting.PingThread.ping(PingThread.java:126) at hudson.remoting.PingThread.run(PingThread.java:85) Caused by: java.util.concurrent.TimeoutException: Ping started at 1479165919577 hasn't completed by 1479166819577 ... 2 more
            Hide
            mmitche Matthew Mitchell added a comment -

            Some more info.

            While I think the "already connected to this master" is a real bug, I think the way that it's triggered might be considered expected. I see this when the number of threads in the system suddenly spikes. This seems to cause some cascading failures (maybe "Unable to allocate new native thread")? In our instance, this might be caused by the Build Failure Analyzer, which can use a lot of threads.

            I do know some other things:

            1) Raising the java heap size reduces the amount of memory for native threads and potentially seems to reduce the time to repro.
            2) Allocating a ton of new threads very fast seems to cause this.

            Show
            mmitche Matthew Mitchell added a comment - Some more info. While I think the "already connected to this master" is a real bug, I think the way that it's triggered might be considered expected. I see this when the number of threads in the system suddenly spikes. This seems to cause some cascading failures (maybe "Unable to allocate new native thread")? In our instance, this might be caused by the Build Failure Analyzer, which can use a lot of threads. I do know some other things: 1) Raising the java heap size reduces the amount of memory for native threads and potentially seems to reduce the time to repro. 2) Allocating a ton of new threads very fast seems to cause this.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            For runaway processes in WinSW there is an open pull-request: https://github.com/kohsuke/winsw/pull/133
            But I agree with Matthew Mitchell that it's is not the only cause.

            We also have some cases with channel hanging due to deadlocks, which have been likely fixed by https://github.com/jenkinsci/remoting/pull/100

            Show
            oleg_nenashev Oleg Nenashev added a comment - For runaway processes in WinSW there is an open pull-request: https://github.com/kohsuke/winsw/pull/133 But I agree with Matthew Mitchell that it's is not the only cause. We also have some cases with channel hanging due to deadlocks, which have been likely fixed by https://github.com/jenkinsci/remoting/pull/100
            Hide
            mmitche Matthew Mitchell added a comment -

            I added a bunch more logging to the system to see why the channel would not get cleared out. I've got another couple pieces of info:

            • For me, there seems to be some relation to Job DSL execution. Not exactly sure why.
            • There's definitely something memory related, as we see a big jump in memory usage when this happens, and there are for sure some allocation failures. However, the overall heap usage is actually really low (compared to avialable size), so it might be the GC telling us unable to allocate.
            • When the error happens, I see the following sequence. 1. Timeout failures monitoring for clock difference, disk space, etc. on the node. 2. Ping failure on the node. 3. already connected to this master.

            Now, some other things to note. I added logging to the listeners on the channel set up that would close the channel and null out the reference on the Computer object (onClosed). However, for the affected nodes, the onClosed methods are never executed, so the listeners never get run.

            Why that is, is the big mystery.

            Show
            mmitche Matthew Mitchell added a comment - I added a bunch more logging to the system to see why the channel would not get cleared out. I've got another couple pieces of info: For me, there seems to be some relation to Job DSL execution. Not exactly sure why. There's definitely something memory related, as we see a big jump in memory usage when this happens, and there are for sure some allocation failures. However, the overall heap usage is actually really low (compared to avialable size), so it might be the GC telling us unable to allocate. When the error happens, I see the following sequence. 1. Timeout failures monitoring for clock difference, disk space, etc. on the node. 2. Ping failure on the node. 3. already connected to this master. Now, some other things to note. I added logging to the listeners on the channel set up that would close the channel and null out the reference on the Computer object (onClosed). However, for the affected nodes, the onClosed methods are never executed, so the listeners never get run. Why that is, is the big mystery.
            Hide
            mmitche Matthew Mitchell added a comment -

            Also, this Issue looks extremely close to what I am seeing. Perhaps there is a lock on the channel object?

            https://issues.jenkins-ci.org/browse/JENKINS-5413

            Show
            mmitche Matthew Mitchell added a comment - Also, this Issue looks extremely close to what I am seeing. Perhaps there is a lock on the channel object? https://issues.jenkins-ci.org/browse/JENKINS-5413
            Hide
            mmitche Matthew Mitchell added a comment -

            FYI when this problem occurs the number of threads starts to climb from steady state at ~800-1500 up to 11k. Most of the extra threads look like they are doing nothing (just a single frame on a thread named Thread-NNNNNN).

            Show
            mmitche Matthew Mitchell added a comment - FYI when this problem occurs the number of threads starts to climb from steady state at ~800-1500 up to 11k. Most of the extra threads look like they are doing nothing (just a single frame on a thread named Thread-NNNNNN).
            Hide
            mariem_baccar mariem baccar added a comment -

            You find below some explications for my case which can help you:
            -Before installing the plugin "Docker Slave v1.0.5", the slave operates correctly without any problems.
            -After installing this plugin, I encountered many problems. One of them is about slave: I always get this message "slave agent is already connected". For more details, you find attached the whole message.
            In fact, the problem is related to JENKINS-39078: There is a problem in Docker Slave Plugin 1.0.5 (Fix: https://github.com/jenkinsci/docker-slaves-plugin/commit/451929125fd8ff39c6f84c30476c26cccb912140)
            After discovering that this plugin recently installed is the source of all my new problems in Jenkins 2.19.1 LTS, I uninstall it and all my problems are resolved.

            Show
            mariem_baccar mariem baccar added a comment - You find below some explications for my case which can help you: -Before installing the plugin "Docker Slave v1.0.5", the slave operates correctly without any problems. -After installing this plugin, I encountered many problems. One of them is about slave: I always get this message "slave agent is already connected". For more details, you find attached the whole message. In fact, the problem is related to JENKINS-39078 : There is a problem in Docker Slave Plugin 1.0.5 (Fix: https://github.com/jenkinsci/docker-slaves-plugin/commit/451929125fd8ff39c6f84c30476c26cccb912140 ) After discovering that this plugin recently installed is the source of all my new problems in Jenkins 2.19.1 LTS, I uninstall it and all my problems are resolved.
            Hide
            mmitche Matthew Mitchell added a comment -

            Do you use Job DSL too? I think there might be something there that is related too.

            Based on the data seen, I think this may be a deadlock.

            Show
            mmitche Matthew Mitchell added a comment - Do you use Job DSL too? I think there might be something there that is related too. Based on the data seen, I think this may be a deadlock.
            Hide
            oleg_nenashev Oleg Nenashev added a comment - - edited

            Another known case when the issue happens - OutOfMemory exception - JENKINS-30823

            Show
            oleg_nenashev Oleg Nenashev added a comment - - edited Another known case when the issue happens - OutOfMemory exception - JENKINS-30823
            Hide
            mmitche Matthew Mitchell added a comment -

            Yep, that correlates with what I've seen too. What is odd about this is that the actual usage doesn't appear that large. A heap dump before the failure doesn't show a heap that is anywhere close to full (3GB usage of 32GB). But I do see lots of "unable to allocate new native thread", etc and the number of threads grows very large from the steady state.

            Could this be related to a deadlock of some sort where we start to spawn threads till we die?

            Show
            mmitche Matthew Mitchell added a comment - Yep, that correlates with what I've seen too. What is odd about this is that the actual usage doesn't appear that large. A heap dump before the failure doesn't show a heap that is anywhere close to full (3GB usage of 32GB). But I do see lots of "unable to allocate new native thread", etc and the number of threads grows very large from the steady state. Could this be related to a deadlock of some sort where we start to spawn threads till we die?
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Maybe. I think we firstly need to implement correct handling of Errors and RuntimeExceptions in the core in order to avoid the case when the channel object leaks after the failure. It should help with some cases

            Show
            oleg_nenashev Oleg Nenashev added a comment - Maybe. I think we firstly need to implement correct handling of Errors and RuntimeExceptions in the core in order to avoid the case when the channel object leaks after the failure. It should help with some cases
            Hide
            mmitche Matthew Mitchell added a comment -

            Right, for sure. I think there are a few cases:

            1) Cases where plugins/core usae computer.getChannel(). These work fine since if the listeners run properly, the channel is set to null and the code can do the right thing at that point depending on what its function is.
            2) Cases where the channel object is held on an object (Ping thread, etc.) but there is error handling. These appear okay (ping thread exits).
            3) Cases where the channel object is held on an object but no error handling - These need fixing.

            I don't know of any real cases of #3. #2 appears okay. #1 is a problem because the listeners which should null out the channel on the computer object don't actually run. That is why we see the "already connected" issue.

            Show
            mmitche Matthew Mitchell added a comment - Right, for sure. I think there are a few cases: 1) Cases where plugins/core usae computer.getChannel(). These work fine since if the listeners run properly, the channel is set to null and the code can do the right thing at that point depending on what its function is. 2) Cases where the channel object is held on an object (Ping thread, etc.) but there is error handling. These appear okay (ping thread exits). 3) Cases where the channel object is held on an object but no error handling - These need fixing. I don't know of any real cases of #3. #2 appears okay. #1 is a problem because the listeners which should null out the channel on the computer object don't actually run. That is why we see the "already connected" issue.
            Hide
            mmitche Matthew Mitchell added a comment -

            This also appears to perhaps happen "near" a failure

            https://issues.jenkins-ci.org/browse/JENKINS-33358

            Show
            mmitche Matthew Mitchell added a comment - This also appears to perhaps happen "near" a failure https://issues.jenkins-ci.org/browse/JENKINS-33358
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Some bits have been addressed in JENKINS-39835

            Show
            oleg_nenashev Oleg Nenashev added a comment - Some bits have been addressed in JENKINS-39835
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Fixed in 2.50

            Show
            oleg_nenashev Oleg Nenashev added a comment - Fixed in 2.50

              People

              • Assignee:
                oleg_nenashev Oleg Nenashev
                Reporter:
                gaffey gaffey he
              • Votes:
                9 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: