Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-33287

Jnlp slave agent die after timeout detected from slave side

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      When JNLP slave detect ping timeout, it tries to reconnect. But if master have not noticed the timeout yet, it rejects the new connection from slave. JNLP slave agent process aborts once the connection is rejected in such a way.

      STDOUT of JNLP process:

      INFO: Ping failed. Terminating the channel.
      java.util.concurrent.TimeoutException: Ping started on 1456918109582 hasn't completed at 1456918349582
      	at hudson.remoting.PingThread.ping(PingThread.java:125)
      	at hudson.remoting.PingThread.run(PingThread.java:86)
      
      Mar 02, 2016 6:32:29 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
      SEVERE: I/O error in channel channel
      java.net.SocketException: Socket closed
      	at java.net.SocketInputStream.read(SocketInputStream.java:190)
      	at java.net.SocketInputStream.read(SocketInputStream.java:122)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
      	at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
      	at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
      	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
      	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
      	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      
      Mar 02, 2016 6:32:29 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      Mar 02, 2016 6:32:39 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect
      INFO: Restarting slave via jenkins.slaves.restarter.UnixSlaveRestarter@6523ff4a
      Mar 02, 2016 6:32:42 AM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: dev127-virt2
      Mar 02, 2016 6:32:42 AM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Mar 02, 2016 6:32:42 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among [http://jenkins.acme.com/hudson/, http://hudson.acme.com/hudson/]
      Mar 02, 2016 6:32:42 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins.acme.com:37003
      Mar 02, 2016 6:32:42 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Mar 02, 2016 6:32:42 AM hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: The server rejected the connection: dev127-virt2 is already connected to this master. Rejecting this connection.
      java.lang.Exception: The server rejected the connection: dev127-virt2 is already connected to this master. Rejecting this connection.
      	at hudson.remoting.Engine.onConnectionRejected(Engine.java:306)
      	at hudson.remoting.Engine.run(Engine.java:276)
      

      Slave log on master:

      JNLP agent connected from /10.16.180.145
      <===[JENKINS REMOTING CAPACITY]===>ERROR: Connection terminated
      Connection terminated
      ha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@131dbee3[name=dev127-virt2]
      	at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:211)
      	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:631)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      	at java.lang.Thread.run(Thread.java:662)
      Caused by: java.io.IOException: Broken pipe
      	at sun.nio.ch.FileDispatcher.write0(Native Method)
      	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
      	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:69)
      	at sun.nio.ch.IOUtil.write(IOUtil.java:40)
      	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:336)
      	at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.send(FifoBuffer.java:130)
      	at org.jenkinsci.remoting.nio.FifoBuffer.send(FifoBuffer.java:254)
      	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:622)
      	... 7 more
      Slave.jar version: 2.47
      This is a Unix slave
      Slave successfully connected and online
      Connection terminated
      

      Master log:

      2016-03-02 06:32:38,352 WARNING [hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor] (Monitoring thread for Clock Difference started on Wed Mar 02 06:32:08 EST 2016) Failed to monitor dev127-virt2 for Clock Difference
      java.util.concurrent.TimeoutException
        at hudson.remoting.Request$1.get(Request.java:271)
        at hudson.remoting.Request$1.get(Request.java:206)
        at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
        at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:97)
        at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:280)
      
      ... (All monitors times out)
      
      2016-03-02 06:32:42,860 INFO  [hudson.TcpSlaveAgentListener] (TCP slave agent connection handler #41773 with /10.16.180.145:58248) Accepted connection #41773 from /10.16.180.145:58248
      2016-03-02 06:32:42,865 WARNING [jenkins.slaves.JnlpSlaveHandshake] (TCP slave agent connection handler #41773 with /10.16.180.145:58248) TCP slave agent connection handler #41773 with /10.16.180.145:58248 is aborted: dev127-virt2 is already connected to this master. Rejecting this connection.
      2016-03-02 06:32:42,866 WARNING [jenkins.slaves.JnlpSlaveHandshake] (TCP slave agent connection handler #41773 with /10.16.180.145:58248) TCP slave agent connection handler #41773 with /10.16.180.145:58248 is aborted: Unrecognized name: dev127-virt2
      
      ...
      
      2016-03-02 06:33:43,630 INFO  [hudson.slaves.ChannelPinger] (Ping thread for channel hudson.remoting.Channel@5caca20e:dev127-virt2) Ping failed. Terminating the channel.
      java.util.concurrent.TimeoutException: Ping started on 1456918183629 hasn't completed at 1456918423630
        at hudson.remoting.PingThread.ping(PingThread.java:125)
        at hudson.remoting.PingThread.run(PingThread.java:86)
      
      ...
      
      2016-03-02 06:38:26,902 WARNING [hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor] (Monitoring thread for Free Temp Space started on Wed Mar 02 06:38:26 EST 2016) Failed to monitor dev127-virt2 for Free Temp Space
      hudson.remoting.ChannelClosedException: channel is already closed
        at hudson.remoting.Channel.send(Channel.java:549)
        at hudson.remoting.Request.callAsync(Request.java:204)
        at hudson.remoting.Channel.callAsync(Channel.java:778)
        at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
        at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:280)
      Caused by: java.io.IOException
        at hudson.remoting.Channel.close(Channel.java:1105)
        at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
        at hudson.remoting.PingThread.ping(PingThread.java:125)
        at hudson.remoting.PingThread.run(PingThread.java:86)
      Caused by: java.util.concurrent.TimeoutException: Ping started on 1456918183629 hasn't completed at 1456918423630
        ... 2 more
      
      ... (All monitors fails with "channel is already closed")
      

        Attachments

          Issue Links

            Activity

            Hide
            rcgroot René de Groot added a comment -

            If rejecting the connection is the correct behavior of the master then the agent should keep retrying so the system as a whole can recover.

            If the rejection is unwarranted because the old connection was dead and the master is to slow to realize this, then the issue lies with the code running on the master node.

            Show
            rcgroot René de Groot added a comment - If rejecting the connection is the correct behavior of the master then the agent should keep retrying so the system as a whole can recover. If the rejection is unwarranted because the old connection was dead and the master is to slow to realize this, then the issue lies with the code running on the master node.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Merging the issue into JENKINS-28492. Likely it is being caused by the agent connection hanging on the master side

            Show
            oleg_nenashev Oleg Nenashev added a comment - Merging the issue into JENKINS-28492 . Likely it is being caused by the agent connection hanging on the master side

              People

              • Assignee:
                Unassigned
                Reporter:
                olivergondza Oliver Gondža
              • Votes:
                4 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: