Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50219

Jenkins agent windows service fails to restart with an unhandled COMException in WinSw's log

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Labels:
    • Environment:
      Jenkins 2.9.84
      windows-slave-installer 1.9.2
      jenkins-slave.exe running on Windows 10 + jre 1.8.0.161 under local admin credentials.
    • Similar Issues:

      Description

      We have noticed that our Windows 10 nodes have been intermittently failing to come back online after Jenkins / master restarts.

       

      The following entries are shown in WinSw's Jenkins-slave.wrapper.log file at the time of the failures. In particular it appears that the unhandled COMException is preventing the service restarting successfully.

       

      2018-03-16 12:26:38,011 DEBUG - Starting ServiceWrapper in the CLI mode
      2018-03-16 12:26:38,589 INFO  - Restarting the service with id 'jenkinsslave-c__jenkins'
      2018-03-16 12:26:38,682 DEBUG - Completed. Exit code is 0
      2018-03-16 12:26:38,870 DEBUG - Starting ServiceWrapper in the CLI mode
      2018-03-16 12:26:39,151 INFO  - Restarting the service with id 'jenkinsslave-c__jenkins'
      2018-03-16 12:26:39,276 INFO  - Stopping jenkinsslave-c__jenkins
      2018-03-16 12:26:39,292 DEBUG - ProcessKill 332
      2018-03-16 12:26:39,386 INFO  - Found child process: 420 Name: conhost.exe
      2018-03-16 12:26:39,433 INFO  - Stopping process 420
      2018-03-16 12:26:39,448 INFO  - Send SIGINT 420
      2018-03-16 12:26:39,464 WARN  - SIGINT to 420 failed - Killing as fallback
      2018-03-16 12:26:39,464 INFO  - Stopping process 332
      2018-03-16 12:26:39,479 INFO  - Send SIGINT 332
      2018-03-16 12:26:39,495 WARN  - SIGINT to 332 failed - Killing as fallback
      2018-03-16 12:26:39,511 INFO  - Finished jenkinsslave-c__jenkins
      2018-03-16 12:26:39,511 DEBUG - Completed. Exit code is 0
      2018-03-16 12:26:40,342 FATAL - Unhandled exception
      System.Runtime.InteropServices.COMException (0x80040150): Could not read key from registry (Exception from HRESULT: 0x80040150 (REGDB_E_READREGDB))
         at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32 errorCode, IntPtr errorInfo)
         at System.Management.ManagementObjectSearcher.Get()
         at WMI.WmiRoot.ClassHandler.Invoke(Object proxy, MethodInfo method, Object[] args)
         at winsw.WrapperService.Run(String[] _args, ServiceDescriptor descriptor)
         at winsw.WrapperService.Main(String[] args)
      

      After some inspection of WinSw's source code we have determined that the exception is being thrown from the line "s = svc.Select(d.Id);" in the following snippet (found in the "Run" method in "src/Core/ServiceWrapper/Main.cs lines 687-704). When a secondary WinSw process is run with the "restart!" parameter by Jenkins.

       

      if (args[0] == "restart")
      {
          Log.Info("Restarting the service with id '" + d.Id + "'");
          if (s == null) 
              ThrowNoSuchService();
          if(s.Started)
              s.StopService();
          while (s.Started)
          {
              Thread.Sleep(1000);
              s = svc.Select(d.Id);
          }
          s.StartService();
          return;
      }

      We are currently able to work around the problem by restarting the service manually on affected nodes.

       

       

        Attachments

          Activity

          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          Tom Manning Please also submit a ticket to https://github.com/kohsuke/winsw . That project is outside Jenkins governance

          Show
          oleg_nenashev Oleg Nenashev added a comment - Tom Manning Please also submit a ticket to https://github.com/kohsuke/winsw . That project is outside Jenkins governance
          Hide
          tom_m_third_dimension Tom Manning added a comment -

          After some further investigation I believe I have found a more permanent work around for the problem. After observing that the problem occurs when the master is restarted after logging out from an interactive session on the "Log On As" account that runs the Jenkins agent service.

          The fix is to enable the group policy setting "Local Computer Policy->Administrative Templates->System->User Profiles->Do not forcefully unload the users registry at user logoff".
           
          The following articles describe the cause:
          https://support.microsoft.com/en-gb/help/2287297/a-com-application-may-stop-working-on-windows-server-2008-when-a-user
          https://support.microsoft.com/en-gb/help/3114011/800703fa-illegal-operation-attempted-on-a-registry-key-that-has-been-m

          Show
          tom_m_third_dimension Tom Manning added a comment - After some further investigation I believe I have found a more permanent work around for the problem. After observing that the problem occurs when the master is restarted after logging out from an interactive session on the "Log On As" account that runs the Jenkins agent service. The fix is to enable the group policy setting "Local Computer Policy->Administrative Templates->System->User Profiles->Do not forcefully unload the users registry at user logoff".   The following articles describe the cause: https://support.microsoft.com/en-gb/help/2287297/a-com-application-may-stop-working-on-windows-server-2008-when-a-user https://support.microsoft.com/en-gb/help/3114011/800703fa-illegal-operation-attempted-on-a-registry-key-that-has-been-m
          Hide
          mario_klebsch Mario Klebsch added a comment -

          I also see this error regularly after server reboots. But I am sure, that there were no interactive sessions on the "Log On As" account of the client failing to restart.

          I change the policy setting anyway, I will see, whether that helps or not.

           

          But I wonder, whether the restart of the jenkins client really is needed. In jenkins-slave.err.log I see the following entries:

           

          Dez 02, 2018 4:56:56 AM hudson.remoting.jnlp.Main$CuiListener status
          INFORMATION: Terminated
          Dez 02, 2018 4:57:11 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
          INFORMATION: Failed to connect to the master. Will try again: java.net.SocketTimeoutException connect timed out
          ...
          Dez 02, 2018 4:58:22 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
          INFORMATION: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused: connect
          ...
          Dez 02, 2018 5:03:46 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
          INFORMATION: Failed to connect to the master. Will try again: java.net.SocketTimeoutException Read timed out
          ...
          Dez 02, 2018 5:04:56 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
          INFORMATION: Master isn't ready to talk to us on http://jenkins01:8080/tcpSlaveAgentListener/. Will try again: response code=503
          ...
          Dez 02, 2018 5:09:20 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1 onReconnect
          INFORMATION: Restarting agent via jenkins.slaves.restarter.WinswSlaveRestarter@63a1d34b
          

           

          None of the errors listed should (IMHO) be a reason to restart the client.

          The log does not tell, why the clients decide to restart itself. It may be the duration of the downtime, but the time difference between 5:09:20 and 4:56:56 is 744 seconds. I have 65 "waitForReady" messages and 26 "Will try again: response code=503" messages.

          At least the "response code=503" error message prevent the jenkins client reboot, especially since it is a known jenkins problem, that jeninks reboots take ages.

           

          Mario

           

           

          Show
          mario_klebsch Mario Klebsch added a comment - I also see this error regularly after server reboots. But I am sure, that there were no interactive sessions on the "Log On As" account of the client failing to restart. I change the policy setting anyway, I will see, whether that helps or not.   But I wonder, whether the restart of the jenkins client really is needed. In jenkins-slave.err.log I see the following entries:   Dez 02, 2018 4:56:56 AM hudson.remoting.jnlp.Main$CuiListener status INFORMATION: Terminated Dez 02, 2018 4:57:11 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady INFORMATION: Failed to connect to the master. Will try again: java.net.SocketTimeoutException connect timed out ... Dez 02, 2018 4:58:22 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady INFORMATION: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused: connect ... Dez 02, 2018 5:03:46 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady INFORMATION: Failed to connect to the master. Will try again: java.net.SocketTimeoutException Read timed out ... Dez 02, 2018 5:04:56 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady INFORMATION: Master isn't ready to talk to us on http: //jenkins01:8080/tcpSlaveAgentListener/. Will try again: response code=503 ... Dez 02, 2018 5:09:20 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1 onReconnect INFORMATION: Restarting agent via jenkins.slaves.restarter.WinswSlaveRestarter@63a1d34b   None of the errors listed should (IMHO) be a reason to restart the client. The log does not tell, why the clients decide to restart itself. It may be the duration of the downtime, but the time difference between 5:09:20 and 4:56:56 is 744 seconds. I have 65 "waitForReady" messages and 26 "Will try again: response code=503" messages. At least the "response code=503" error message prevent the jenkins client reboot, especially since it is a known jenkins problem, that jeninks reboots take ages.   Mario    
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          I might be able to work on it at some point, but I will unassign it for now so that anybody else can pick it up and work on it

          Show
          oleg_nenashev Oleg Nenashev added a comment - I might be able to work on it at some point, but I will unassign it for now so that anybody else can pick it up and work on it
          Hide
          whatsdevops Angelo Loria added a comment -

          Tom Manning added a comment - 2018-05-24 05:47

          After some further investigation I believe I have found a more permanent work around for the problem. After observing that the problem occurs when the master is restarted after logging out from an interactive session on the "Log On As" account that runs the Jenkins agent service.

          The fix is to enable the group policy setting "Local Computer Policy->Administrative Templates->System->User Profiles->Do not forcefully unload the users registry at user logoff".

          FYI this resolved the issue for us on a Windows Server 2019 VM.

          Show
          whatsdevops Angelo Loria added a comment - Tom Manning  added a comment - 2018-05-24 05:47 After some further investigation I believe I have found a more permanent work around for the problem. After observing that the problem occurs when the master is restarted after logging out from an interactive session on the "Log On As" account that runs the Jenkins agent service. The fix is to enable the group policy setting "Local Computer Policy->Administrative Templates->System->User Profiles->Do not forcefully unload the users registry at user logoff". FYI this resolved the issue for us on a Windows Server 2019 VM.

            People

            • Assignee:
              Unassigned
              Reporter:
              tom_m_third_dimension Tom Manning
            • Votes:
              10 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated: