Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47657

Agent running as Windows service kills all running jobs on reconnect

    Details

    • Similar Issues:

      Description

      We are running several JNLP slaves on Windows as Windows service using the Winsw wrapper. On some machines, when an agent loses the connection to the master, all running processes are killed and the jobs never complete.

      This happens since the agent tries to restart itself when it loses connection. There are two possibilities:

      • If the agent runs as a user that is a local admin (sadly the default, since services run as the SYSTEM user by default), winsw restarts the service. Upon restarting the service, both winsw and Windows kill all processes that belong to the service, which includes all processes of currently running jobs.
      • If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running.

      Frankly, I don't see any reason why an agent should restart itself on connection loss. In the case of an agent running as a Windows service, it can never work properly and is thus entirely useless.

      A solution would be to remove jenkins.slaves.restarter.WinswSlaveRestarter entirely.

        Attachments

          Activity

          Hide
          oleg_nenashev Oleg Nenashev added a comment - - edited

          IIRC Windows service agent restart happens if and only if the JNLP process experiences a severe issue. In such case a Channel will be broken, and all non-durable tasks will likely get aborted anyway.

          > If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running.

          Are you talking about Pipeline jobs or other Durable Task implementations?

          > A solution would be to remove jenkins.slaves.restarter.WinswSlaveRestarter entirely.

          Adding a flag for Disabling the restarter is definitely reasonable. Regarding the complete removal, it needs more research. I have never been brave enough to run Jenkins agents with a local admin.

          > (sadly the default, since services run as the SYSTEM user by default)

          Yes, I would rather rework the current Installer GUI entirely. It should just generate a sample config and then point the user to installation guidelines. I hope nobody runs JNLP files as administrator, so the service installation in Web UI should fail by default Win7+ systems.

          Show
          oleg_nenashev Oleg Nenashev added a comment - - edited IIRC Windows service agent restart happens if and only if the JNLP process experiences a severe issue. In such case a Channel will be broken, and all non-durable tasks will likely get aborted anyway. > If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running. Are you talking about Pipeline jobs or other Durable Task implementations? > A solution would be to remove jenkins.slaves.restarter.WinswSlaveRestarter entirely. Adding a flag for Disabling the restarter is definitely reasonable. Regarding the complete removal, it needs more research. I have never been brave enough to run Jenkins agents with a local admin. > (sadly the default, since services run as the SYSTEM user by default) Yes, I would rather rework the current Installer GUI entirely. It should just generate a sample config and then point the user to installation guidelines. I hope nobody runs JNLP files as administrator, so the service installation in Web UI should fail by default Win7+ systems.
          Hide
          procom_bl Thomas Bächler added a comment -

          > IIRC Windows service agent restart happens if and only if the JNLP process experiences a severe issue. In such case a Channel will be broken, and all non-durable tasks will likely get aborted anyway.

          From what I can tell from the logs, a restart happens on every connection loss:

          Okt 02, 2017 9:26:05 AM hudson.remoting.jnlp.Main$CuiListener status
          INFORMATION: Connected
          Okt 02, 2017 10:04:20 AM hudson.remoting.jnlp.Main$CuiListener status
          INFORMATION: Terminated
          Okt 02, 2017 10:04:35 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
          INFORMATION: Failed to connect to the master. Will retry again
          [...] (previous message repeats until the master is reachable again
          Okt 02, 2017 10:06:01 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect
          INFORMATION: Restarting agent via jenkins.slaves.restarter.WinswSlaveRestarter@3c30a4a0

          > > If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running.

          > Are you talking about Pipeline jobs or other Durable Task implementations?

          We only use pipeline jobs.

          > Adding a flag for Disabling the restarter is definitely reasonable. Regarding the complete removal, it needs more research.

          Fair enough, though I still fail to see a case where the restart is useful.

          > > (sadly the default, since services run as the SYSTEM user by default)

          > Yes, I would rather rework the current Installer GUI entirely. It should just generate a sample config and then point the user to installation guidelines. I hope nobody runs JNLP files as administrator, so the service installation in Web UI should fail by default Win7+ systems.

          True, the installer from the GUI always fails. However, running jenkins-slave.exe install as admin (after the GUI install failed) installs the service, but sets its executing user to the SYSTEM user (which is the Windows default). This is very bad practice IMO - but if that default is changed, the WinswSlaveRestarter would never work.

          Show
          procom_bl Thomas Bächler added a comment - > IIRC Windows service agent restart happens if and only if the JNLP process experiences a severe issue. In such case a Channel will be broken, and all non-durable tasks will likely get aborted anyway. From what I can tell from the logs, a restart happens on every connection loss: Okt 02, 2017 9:26:05 AM hudson.remoting.jnlp.Main$CuiListener status INFORMATION: Connected Okt 02, 2017 10:04:20 AM hudson.remoting.jnlp.Main$CuiListener status INFORMATION: Terminated Okt 02, 2017 10:04:35 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady INFORMATION: Failed to connect to the master. Will retry again [...] (previous message repeats until the master is reachable again Okt 02, 2017 10:06:01 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect INFORMATION: Restarting agent via jenkins.slaves.restarter.WinswSlaveRestarter@3c30a4a0 > > If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running. > Are you talking about Pipeline jobs or other Durable Task implementations? We only use pipeline jobs. > Adding a flag for Disabling the restarter is definitely reasonable. Regarding the complete removal, it needs more research. Fair enough, though I still fail to see a case where the restart is useful. > > (sadly the default, since services run as the SYSTEM user by default) > Yes, I would rather rework the current Installer GUI entirely. It should just generate a sample config and then point the user to installation guidelines. I hope nobody runs JNLP files as administrator, so the service installation in Web UI should fail by default Win7+ systems. True, the installer from the GUI always fails. However, running jenkins-slave.exe install as admin (after the GUI install failed) installs the service, but sets its executing user to the SYSTEM user (which is the Windows default). This is very bad practice IMO - but if that default is changed, the WinswSlaveRestarter would never work.
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          > However, running jenkins-slave.exe install as admin (after the GUI install failed) installs the service, but sets its executing user to the SYSTEM user (which is the Windows default). This is very bad practice IMO - but if that default is changed, the WinswSlaveRestarter would never work.

          Actually it's configurable: https://github.com/kohsuke/winsw/blob/master/doc/xmlConfigFile.md#service-account . The problem is that the the option is not provided by default in the Jenkins config. Defining passwords as a plain text is also far from being a good recommendation, but WinSW also supports interactive mode.

          Show
          oleg_nenashev Oleg Nenashev added a comment - > However, running jenkins-slave.exe install as admin (after the GUI install failed) installs the service, but sets its executing user to the SYSTEM user (which is the Windows default). This is very bad practice IMO - but if that default is changed, the WinswSlaveRestarter would never work. Actually it's configurable: https://github.com/kohsuke/winsw/blob/master/doc/xmlConfigFile.md#service-account . The problem is that the the option is not provided by default in the Jenkins config. Defining passwords as a plain text is also far from being a good recommendation, but WinSW also supports interactive mode.
          Hide
          mus65 m t added a comment -

          Has anybody found a workaround for this? We have a windows agent that has to run as an unprivileged user and it's quite annoying that it doesn't restart itself when it disconnects.

          I also don't see a reason for the service to ever restart. It completely breaks the durability of pipeline jobs. JENKINS-27617 may also fix this, but imho not restarting in the first place is a better option.

          This is with Jenkins 2.152 and the agent running on Windows 10 x64.

          Show
          mus65 m t added a comment - Has anybody found a workaround for this? We have a windows agent that has to run as an unprivileged user and it's quite annoying that it doesn't restart itself when it disconnects. I also don't see a reason for the service to ever restart. It completely breaks the durability of pipeline jobs.  JENKINS-27617 may also fix this, but imho not restarting in the first place is a better option. This is with Jenkins 2.152 and the agent running on Windows 10 x64.
          Hide
          mus65 m t added a comment -

          For anyone else looking for a workaround, it turns out using SSH with windows works quite well. I installed it as described here and connected the node with "Launch agent via SSH".

          https://github.com/PowerShell/Win32-OpenSSH/wiki/Install-Win32-OpenSSH

          Show
          mus65 m t added a comment - For anyone else looking for a workaround, it turns out using SSH with windows works quite well. I installed it as described here and connected the node with "Launch agent via SSH". https://github.com/PowerShell/Win32-OpenSSH/wiki/Install-Win32-OpenSSH
          Hide
          fsteff Flemming Steffensen added a comment -

          One of our Win-10 machines were very seriously affected by this, going offline several times per day.

          As a workaround, I added a batch script to check the status of the service every 10 minutes, and restarting the service if stopped.

          The batch is scheduled by Windows Schedule Tasks service, and set to run as a high priority task whenever the computer starts. Note the script must be run by at least a local administrator.

          I've placed the following code in a file called c:\Jenkins\EnsureJenkinsServiceRunnning.cmd :

          @echo off
          set "ServiceName=jenkinsslave-C__Jenkins"
          for /F "tokens=3 delims=: " %%H in ('sc query "%ServiceName%" ^| findstr "        STATE"') DO (
            if /i "%%H" neq "RUNNING" (
             net start "%ServiceName%"
            )
          )
          
          Show
          fsteff Flemming Steffensen added a comment - One of our Win-10 machines were very seriously affected by this, going offline several times per day. As a workaround, I added a batch script to check the status of the service every 10 minutes, and restarting the service if stopped. The batch is scheduled by Windows Schedule Tasks service, and set to run as a high priority task whenever the computer starts. Note the script must be run by at least a local administrator. I've placed the following code in a file called c:\Jenkins\EnsureJenkinsServiceRunnning.cmd : @echo off set "ServiceName=jenkinsslave-C__Jenkins" for /F "tokens=3 delims=: " %%H in ( 'sc query "%ServiceName%" ^| findstr " STATE" ' ) DO ( if /i "%%H" neq "RUNNING" ( net start "%ServiceName%" ) )

            People

            • Assignee:
              kohsuke Kohsuke Kawaguchi
              Reporter:
              procom_bl Thomas Bächler
            • Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated: