Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43128

JNLP Slave run as windows service fail to reconnect after slave reboot

XMLWordPrintable

       

      We have an issue where windows slaves fall off line every time our infrastructure team patches them.  The scenario is simply this.  

      1. The machines get patched with the lastest windows patches.
      2. This triggers a reboot.
      3. The slave service shuts down with a log entry in the jenkins-slave.wrapper log to the effect of:
        2017-03-27 07:50:19 - Shutdown exception
        Message:A system shutdown is in progress. (Exception from HRESULT: 0x8007045B)
        Stacktrace:   at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32 errorCode, IntPtr errorInfo)
           at System.Management.ManagementScope.InitializeGuts(Object o)
           at System.Management.ManagementScope.Initialize()
           at System.Management.ManagementObjectSearcher.Initialize()
           at System.Management.ManagementObjectSearcher.Get()
           at winsw.WrapperService.GetChildPids(Int32 pid)
           at winsw.WrapperService.StopProcessAndChildren(Int32 pid)
           at winsw.WrapperService.StopIt()
           at winsw.WrapperService.OnShutdown()
         
      1. (4) The slave restarts and we see this in the jenkins-slave_<date>.err log:
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main createEngine
        INFO: Setting up slave: sv20-jenddb-001
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener <init>
        INFO: Jenkins agent is running in headless mode.
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Locating server among [https://jenkins.core.cvent.org/]
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Handshaking
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Connecting to jenkins.core.cvent.org:55087
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Server reports protocol JNLP3-connect not supported, skipping
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Trying protocol: JNLP2-connect
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Server didn't accept the handshake: sv20-jenddb-001 is already connected to this master. Rejecting this connection.
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Connecting to jenkins.core.cvent.org:55087
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Trying protocol: JNLP-connect
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Server didn't accept the handshake: sv20-jenddb-001 is already connected to this master. Rejecting this connection.
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Connecting to jenkins.core.cvent.org:55087
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener error
        SEVERE: The server rejected the connection: None of the protocols were accepted
        java.lang.Exception: The server rejected the connection: None of the protocols were accepted
        	at hudson.remoting.Engine.onConnectionRejected(Engine.java:380)
        	at hudson.remoting.Engine.run(Engine.java:352)
         

      We then go in and restart the slave service manually and everything is fine.

      What seems to be happening is that when the slave service shuts down due to a system shutdown request, it fails to notify the master that it is shutting down.  As a result, when it starts back up after the reboot, the master still thinks it is connected and refuses to allow it to connect.  By the time we get in there to manually restart the service, the master realized the slave is off line, so the service restart/reconnection works fine at that point.

      It seems there are two possible solutions here:

      1. The slave should notify the master that it is shutting down so that the master will not still think it is 'online'.
      2. The master, when it receives a connection request for a slave that it thinks is 'online' should verify that the old connection is really still active before refusing to accept the new one.

      Or do both?

      Note we are able to reproduce this simply by rebooting a windows slave.  It always fails to reconnect as described.

            oleg_nenashev Oleg Nenashev
            kbaltrinic Kenneth Baltrinic
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: