Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-38807

Jenkins 2.7.4 seems to leave behind Java processes (on Windows agent) if the build is aborted/agent loses connection

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Component/s: core
    • Labels:
    • Environment:
      Jenkins 2.7.4, Windows 7-10, Java 8
    • Similar Issues:

      Description

      We have a build step that runs a TestNG suite, with the command looking something like this:

      java -jar -Done-jar.main.class=org.testng.TestNG the-jar.jar TheTest.xml
      

      If the process is aborted in any way (manual intervention, Jenkins build timeout, etc.) OR if the agent loses connection from the master long enough to fail the build, then there is a Java process left behind.

      This is particularly damaging to us, as we load a DLL in the Java process, locking the file handle. If we attempt the job again, we cannot load the DLL again, meaning that all future builds will fail without manual intervention (killing the leftover process manually).

      It is possible to reproduce with ANY java process executed on the Windows agent.

      This bug seems similar to JENKINS-26048, but I did not understand from the title/description if it was the same problem or similar symptoms. Feel free to close as duplicate if it is.

        Attachments

          Issue Links

            Activity

            Hide
            krogan mark mann added a comment -

            I have just noticed that WinSW is now on 2.0.1
            I will upgrade and see if the problem still exists.. if it does, I will raise a separate bug
            thx!

            Show
            krogan mark mann added a comment - I have just noticed that WinSW is now on 2.0.1 I will upgrade and see if the problem still exists.. if it does, I will raise a separate bug thx!
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            mark mann any updates?

            Show
            oleg_nenashev Oleg Nenashev added a comment - mark mann any updates?
            Hide
            krogan mark mann added a comment - - edited

            hi oleg...

            so for the moment I have stayed on 2.32.2 and still see the problem when the slave server is rebooted, the slave starts (via the service auto start) and then it stops the service but retains a java process which we can't kill.

            i'm a C#'er so i can read but not debug the java code.... but it just feels like the master can't handle the new process handshake when the slave computer reboots and tries to reestablish the connection, so then one end terminates it (maybe the master because the slave still tries to live on?)  

            to that effect, I've even played about with the master's polling interval to see if i can get the master to terminate while the server is rebooting, but it feels like i am playing with fire on a global setting such as that... where the proportionate time of servers in disconnected reboot vs online is minuscule. 

            -Dhudson.remoting.Launcher.pingIntervalSec=55

             

            There is minimal detail in the logs.. nothing on the windows log and only slave connection terminated messages if I am lucky on the slave log 

            I did try and update my service host to WinSW 2.0.1 but as soon as the service starts, it looks to jenkins and then switches it out for 1.18 instead. (not sure if there is a way of getting jenkins slave to stop doing this, but i guess master/slave is trying to maintain compatibility)

            i am waiting in eagerness for the LTS of 2.50+ which has a heap of your changes regarding the windows slaves and service host.

            /mm 

            Show
            krogan mark mann added a comment - - edited hi oleg... so for the moment I have stayed on 2.32.2 and still see the problem when the slave server is rebooted, the slave starts (via the service auto start) and then it stops the service but retains a java process which we can't kill. i'm a C#'er so i can read but not debug the java code.... but it  just   feels like the master can't handle the new process handshake when the slave computer reboots and tries to reestablish the connection, so then one end terminates it (maybe the master because the slave still tries to live on?)   to that effect, I've even played about with the master's polling interval to see if i can get the master to terminate while the server is rebooting, but it feels like i am playing with fire on a global setting such as that... where the proportionate time of servers in disconnected reboot vs online is minuscule.  -Dhudson.remoting.Launcher.pingIntervalSec=55   There is minimal detail in the logs.. nothing on the windows log and only slave connection terminated messages if I am lucky on the slave log  I did try and update my service host to WinSW 2.0.1 but as soon as the service starts, it looks to jenkins and then switches it out for 1.18 instead. (not sure if there is a way of getting jenkins slave to stop doing this, but i guess master/slave is trying to maintain compatibility) i am waiting in eagerness for the LTS of 2.50+ which has a heap of your changes regarding the windows slaves and service host. /mm 
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            > I did try and update my service host to WinSW 2.0.1 but as soon as the service starts, it looks to jenkins and then switches it out for 1.18 instead. (not sure if there is a way of getting jenkins slave to stop doing this, but i guess master/slave is trying to maintain compatibility)

            It is a "self-upgrade" feature I have added a flag for disabling this autoupdate in Windows Agent Installer 1.9 (https://github.com/jenkinsci/windows-slave-installer-module/blob/master/CHANGELOG.md#19), see JENKINS-43603 . But it has not been integrated into Jenkins weekly yet.

            As a workaround, you can make the file read-only for the service account.

             

            > but it just feels like the master can't handle the new process handshake when the slave computer reboots and tries to reestablish the connection, so then one end terminates it (maybe the master because the slave still tries to live on?)  

            > to that effect, I've even played about with the master's polling interval to see if i can get the master to terminate while the server is rebooting, but it feels like i am playing with fire on a global setting such as that... where the proportionate time of servers in disconnected reboot vs online is minuscule.

            One of the potential causes for hanging agent is a non-released Channel object in the master. We have applied several fixes for it, but I am not 100% all potential causes are covered. Just in case, make sure you a running agents with JNLP4 protocol. It seems to be much more reliable in terms of connection handling.

             

             

             

             

            Show
            oleg_nenashev Oleg Nenashev added a comment - > I did try and update my service host to WinSW 2.0.1 but as soon as the service starts, it looks to jenkins and then switches it out for 1.18 instead. (not sure if there is a way of getting jenkins slave to stop doing this, but i guess master/slave is trying to maintain compatibility) It is a "self-upgrade" feature I have added a flag for disabling this autoupdate in Windows Agent Installer 1.9 ( https://github.com/jenkinsci/windows-slave-installer-module/blob/master/CHANGELOG.md#19), see JENKINS-43603 . But it has not been integrated into Jenkins weekly yet. As a workaround, you can make the file read-only for the service account.   > but it  just   feels like the master can't handle the new process handshake when the slave computer reboots and tries to reestablish the connection, so then one end terminates it (maybe the master because the slave still tries to live on?)   > to that effect, I've even played about with the master's polling interval to see if i can get the master to terminate while the server is rebooting, but it feels like i am playing with fire on a global setting such as that... where the proportionate time of servers in disconnected reboot vs online is minuscule. One of the potential causes for hanging agent is a non-released Channel object in the master. We have applied several fixes for it, but I am not 100% all potential causes are covered. Just in case, make sure you a running agents with JNLP4 protocol. It seems to be much more reliable in terms of connection handling.        
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            mark mann So 2.60.1 should be released on this Thursday. Just FYI. You can try the release candidate from here: http://mirrors.jenkins.io/war-stable-rc/2.60.1/ 

            Show
            oleg_nenashev Oleg Nenashev added a comment - mark mann So 2.60.1 should be released on this Thursday. Just FYI. You can try the release candidate from here: http://mirrors.jenkins.io/war-stable-rc/2.60.1/  

              People

              • Assignee:
                Unassigned
                Reporter:
                gsfraley Greg Fraley
              • Votes:
                2 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: