Details

    • Type: Bug
    • Status: In Review (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: core
    • Labels:
      None
    • Environment:
      any
    • Similar Issues:

      Description

      Using the freestyle projects to execute bash shell scripts work fine. But cancelling a jenkins job seems to use SIGKILL. In this way the script cannot perform cleanup operations and free resources.

      SIGKILL cannot be handled by shell

      SIGINT/SIGTERM are not used by jenkins

      Preferred: SIGINT -> wait 5 seconds -> SIGKILL

        Attachments

          Issue Links

            Activity

            markusb Markus Breuer created issue -
            Hide
            deepchip Martin d'Anjou added a comment - - edited

            I created this freestyle job, but the traps are never invoked when hitting [x] to "stop" the job.

            #!/bin/bash
            echo "Starting $0"
            echo "Listing traps"
            trap -p
            echo "Setting trap"
            trap 'echo SIGTERM; kill $pid; exit 15;' SIGTERM
            trap 'echo SIGINT; kill $pid; exit 2;' SIGINT
            echo "Listing traps again"
            trap -p
            echo "Sleeping"
            sleep 10 & pid=$!
            echo "Waiting"
            wait $pid
            echo "Exit status: $?"
            echo "Ending"
            

            It looks like Jenkins is using kill -9, but it is not since the rest of the script is executed:

            Listing traps
            Setting trap
            Listing traps again
            trap -- 'echo SIGINT; kill $pid; exit 2;' SIGINT
            trap -- 'echo SIGTERM; kill $pid; exit 15;' SIGTERM
            Sleeping
            Waiting
            Build was aborted
            Aborted by d'Anjou, Martin
            Build step 'Groovy Postbuild' marked build as failure
            Recording test results
            Exit status: 143
            Ending
            

            Is it possible that Jenkins disables the traps?

            Show
            deepchip Martin d'Anjou added a comment - - edited I created this freestyle job, but the traps are never invoked when hitting [x] to "stop" the job. #!/bin/bash echo "Starting $0" echo "Listing traps" trap -p echo "Setting trap" trap 'echo SIGTERM; kill $pid; exit 15;' SIGTERM trap 'echo SIGINT; kill $pid; exit 2;' SIGINT echo "Listing traps again" trap -p echo "Sleeping" sleep 10 & pid=$! echo "Waiting" wait $pid echo "Exit status: $?" echo "Ending" It looks like Jenkins is using kill -9, but it is not since the rest of the script is executed: Listing traps Setting trap Listing traps again trap -- 'echo SIGINT; kill $pid; exit 2;' SIGINT trap -- 'echo SIGTERM; kill $pid; exit 15;' SIGTERM Sleeping Waiting Build was aborted Aborted by d'Anjou, Martin Build step 'Groovy Postbuild' marked build as failure Recording test results Exit status: 143 Ending Is it possible that Jenkins disables the traps?
            Hide
            deepchip Martin d'Anjou added a comment -

            Making this a major issue because there is no way a free style job can clean up after itself.

            Show
            deepchip Martin d'Anjou added a comment - Making this a major issue because there is no way a free style job can clean up after itself.
            deepchip Martin d'Anjou made changes -
            Field Original Value New Value
            Priority Minor [ 4 ] Major [ 3 ]
            Hide
            torbent torbent added a comment -

            I am struggling with this as well! There is documentation which states that Jenkins uses SIGTERM to kill processes, but I too am having a hard time trapping it. One of the problems I have is that even if my script might trap the TERM, Jenkins appears to not wait for termination of the process(es) it has started. It's a bit difficult, then, to know whether the traps work or not when I cannot see the output.

            You should be aware that the bash build scripts are usually invoked with -e, which may "break" your error handling. Jenkins will list all of the processes you have started, including the sleep, and send a TERM to all of them. Your sleep then fails (before you can kill it), causing the rest of the script to fail. It looks like you may have worked around that to get the "Ending" text out, but it caught me and may confuse others trying to reproduce the problem
            The "list all of the processes" part involves an environment variable called BUILD_ID. See https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller

            By using a set +e (and maybe BUILD_ID=ignore – so many experiments lately) I have managed to make my script ignore TERM, which can consistently lead to an orphaned bash. Jenkins is certain the build is aborted, but the script keeps running. I can kill the script (behind Jenkins) with -9, however.

            Show
            torbent torbent added a comment - I am struggling with this as well! There is documentation which states that Jenkins uses SIGTERM to kill processes, but I too am having a hard time trapping it. One of the problems I have is that even if my script might trap the TERM, Jenkins appears to not wait for termination of the process(es) it has started. It's a bit difficult, then, to know whether the traps work or not when I cannot see the output. You should be aware that the bash build scripts are usually invoked with -e, which may "break" your error handling. Jenkins will list all of the processes you have started, including the sleep, and send a TERM to all of them. Your sleep then fails (before you can kill it), causing the rest of the script to fail. It looks like you may have worked around that to get the "Ending" text out, but it caught me and may confuse others trying to reproduce the problem The "list all of the processes" part involves an environment variable called BUILD_ID. See https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller By using a set +e (and maybe BUILD_ID=ignore – so many experiments lately) I have managed to make my script ignore TERM, which can consistently lead to an orphaned bash. Jenkins is certain the build is aborted, but the script keeps running. I can kill the script (behind Jenkins) with -9, however.
            Hide
            deepchip Martin d'Anjou added a comment -

            When the shell script starts with the shabang:

            #!/bin/bash
            set -o
            echo $-
            

            I get:

            allexport      	off
            braceexpand    	on
            emacs          	off
            errexit        	off
            errtrace       	off
            functrace      	off
            hashall        	on
            histexpand     	off
            history        	off
            ignoreeof      	off
            interactive-comments	on
            keyword        	off
            monitor        	off
            noclobber      	off
            noexec         	off
            noglob         	off
            nolog          	off
            notify         	off
            nounset        	off
            onecmd         	off
            physical       	off
            pipefail       	off
            posix          	off
            privileged     	off
            verbose        	off
            vi             	off
            xtrace         	off
            hB
            

            When the shell script does not start with the shabang:

            set -o
            echo $-
            

            I get:

            + set -o
            allexport      	off
            braceexpand    	on
            emacs          	off
            errexit        	on
            errtrace       	off
            functrace      	off
            hashall        	on
            histexpand     	off
            history        	off
            ignoreeof      	off
            interactive-comments	on
            keyword        	off
            monitor        	off
            noclobber      	off
            noexec         	off
            noglob         	off
            nolog          	off
            notify         	off
            nounset        	off
            onecmd         	off
            physical       	off
            pipefail       	off
            posix          	on
            privileged     	off
            verbose        	off
            vi             	off
            xtrace         	on
            + echo ehxB
            ehxB
            

            Conclusion: Jenkins forces -ex when there is no shabang (#!/bin/bash) line, so you can control at least that part.

            Show
            deepchip Martin d'Anjou added a comment - When the shell script starts with the shabang: #!/bin/bash set -o echo $- I get: allexport off braceexpand on emacs off errexit off errtrace off functrace off hashall on histexpand off history off ignoreeof off interactive-comments on keyword off monitor off noclobber off noexec off noglob off nolog off notify off nounset off onecmd off physical off pipefail off posix off privileged off verbose off vi off xtrace off hB When the shell script does not start with the shabang: set -o echo $- I get: + set -o allexport off braceexpand on emacs off errexit on errtrace off functrace off hashall on histexpand off history off ignoreeof off interactive-comments on keyword off monitor off noclobber off noexec off noglob off nolog off notify off nounset off onecmd off physical off pipefail off posix on privileged off verbose off vi off xtrace on + echo ehxB ehxB Conclusion: Jenkins forces -ex when there is no shabang (#!/bin/bash) line, so you can control at least that part.
            Hide
            deepchip Martin d'Anjou added a comment -

            First point: Changing the value of the BUILD_ID variable to bypass the tree killed is a bad idea: it changes the meaning of BUILD_ID. It would have been better to use a different variable name to express the "don't kill me" idea (hint: if the user sets DONTKILLME=true, then don't kill it).

            Second point: Changing BUILD_ID has no effect on the example script shown in the first comment: it seems Jenkins disables the traps. I tried setting BUILD_ID in a job parameter and in the environment injection plugin to no avail.

            Here are 2 scenarios explaining why Jenkins must not intercept the signals and must let the freestyle jobs handle their own termination:
            1) the freestyle job needs a way to remove temporary files it might have created
            2) the freestyle job needs a way to kill remote processes it might have created

            I feel scenario 2 needs an explanation: Say the freestyle job spawned a process on a remote host, and disconnected from that remote host. There is no way for the process tree killer to find the connection between the freestyle job bash script, and the remote process, only the freestyle job script can kill the remote job. This is why signals must be propagated and not intercepted.

            Show
            deepchip Martin d'Anjou added a comment - First point: Changing the value of the BUILD_ID variable to bypass the tree killed is a bad idea: it changes the meaning of BUILD_ID. It would have been better to use a different variable name to express the "don't kill me" idea (hint: if the user sets DONTKILLME=true, then don't kill it). Second point: Changing BUILD_ID has no effect on the example script shown in the first comment: it seems Jenkins disables the traps. I tried setting BUILD_ID in a job parameter and in the environment injection plugin to no avail. Here are 2 scenarios explaining why Jenkins must not intercept the signals and must let the freestyle jobs handle their own termination: 1) the freestyle job needs a way to remove temporary files it might have created 2) the freestyle job needs a way to kill remote processes it might have created I feel scenario 2 needs an explanation: Say the freestyle job spawned a process on a remote host, and disconnected from that remote host. There is no way for the process tree killer to find the connection between the freestyle job bash script, and the remote process, only the freestyle job script can kill the remote job. This is why signals must be propagated and not intercepted.
            deepchip Martin d'Anjou made changes -
            Link This issue is related to JENKINS-3105 [ JENKINS-3105 ]
            Hide
            deepchip Martin d'Anjou added a comment -

            After experimenting some more, it seems Jenkins cuts the ties to the child process too soon after sending the TERM signal. Some times, when the job runs on the master, I do see the message from the SIGTERM trap, and a lot of times, I don't see it. This makes it hard to tell what really happens. It looks like Jenkins simply needs to wait for the job process to cut the ties to stdout/stderr before it stops listening to the job itself.

            On IRC (May 8, 2013), there was a discussion on changing SIGTERM to SIGTERM -> wait 10 sec -> SIGKILL, but I would prefer if this delay was configurable or even optional, as the clean up done by a properly behaving job could take more than 10 seconds (and it does take a few minutes in my case due to a very large amount of small files to clean up on NFS).

            Here are loosely related but different requests:
            JENKINS-11995
            JENKINS-11996

            Show
            deepchip Martin d'Anjou added a comment - After experimenting some more, it seems Jenkins cuts the ties to the child process too soon after sending the TERM signal. Some times, when the job runs on the master, I do see the message from the SIGTERM trap, and a lot of times, I don't see it. This makes it hard to tell what really happens. It looks like Jenkins simply needs to wait for the job process to cut the ties to stdout/stderr before it stops listening to the job itself. On IRC (May 8, 2013), there was a discussion on changing SIGTERM to SIGTERM -> wait 10 sec -> SIGKILL, but I would prefer if this delay was configurable or even optional, as the clean up done by a properly behaving job could take more than 10 seconds (and it does take a few minutes in my case due to a very large amount of small files to clean up on NFS). Here are loosely related but different requests: JENKINS-11995 JENKINS-11996
            Hide
            owenmehegan Owen Mehegan added a comment -

            This may explain a problem I've been seeing. When a user cancels a build while a Ruby 'bundle install' operation is happening, the job exits but the bundle process goes into a zombie-ish state (not literally a zombie process but it never exits), no longer a child of the Jenkins process. I have to kill it manually, and sometimes it freaks out and consumes a lot of resources on the box as well. I'm not sure if we need a bigger/different hammer here, or what.

            Show
            owenmehegan Owen Mehegan added a comment - This may explain a problem I've been seeing. When a user cancels a build while a Ruby 'bundle install' operation is happening, the job exits but the bundle process goes into a zombie-ish state (not literally a zombie process but it never exits), no longer a child of the Jenkins process. I have to kill it manually, and sometimes it freaks out and consumes a lot of resources on the box as well. I'm not sure if we need a bigger/different hammer here, or what.
            Hide
            deepchip Martin d'Anjou added a comment -

            Jenkins leaks processes when jobs are killed. I think this is related to this issue, so instead of creating a new bug report, I am adding this comment.

            To reproduce the process leak, create a new freestyle job from a fresh install, and enter this script:

            #!/usr/bin/python
            import signal
            import time
            print "Main 1"
            def handler(*ignored):
                print "Ignored 1"
                time.sleep(120)
                print "Ignored 2"
            
            print "Main 2"
            signal.signal(signal.SIGTERM, handler)
            print "Main 3"
            time.sleep(120)
            print "Main 4"
            

            Then execute the build, and after a few seconds once the build is running, hit the red [x] button to kill the job. After the job is killed and Jenkins is done, go to the terminal and look for the python process. You should find something like this:

            $ ps -efH
            ...
            mdanjou   2154  2150  0 08:22 pts/0    00:00:00     bash
            mdanjou   2531  2154 16 08:24 pts/0    00:00:36       java -jar jenkins.war
            mdanjou   2601  2531  0 08:25 pts/0    00:00:00         /usr/bin/python /tmp/hudson3048464595979281901.sh
            

            The python script is still in memory, and still executing. However, Jenkins has cut the ties to the python script.

            Jenkins must not cut the ties until the script is done.

            In this comment, the script is a simple example, in real project scripts, the signal handler is used to clean up temporary files, and to terminate gracefully (e.g. killing other spawned processes).

            Show
            deepchip Martin d'Anjou added a comment - Jenkins leaks processes when jobs are killed. I think this is related to this issue, so instead of creating a new bug report, I am adding this comment. To reproduce the process leak, create a new freestyle job from a fresh install, and enter this script: #!/usr/bin/python import signal import time print "Main 1" def handler(*ignored): print "Ignored 1" time.sleep(120) print "Ignored 2" print "Main 2" signal.signal(signal.SIGTERM, handler) print "Main 3" time.sleep(120) print "Main 4" Then execute the build, and after a few seconds once the build is running, hit the red [x] button to kill the job. After the job is killed and Jenkins is done, go to the terminal and look for the python process. You should find something like this: $ ps -efH ... mdanjou 2154 2150 0 08:22 pts/0 00:00:00 bash mdanjou 2531 2154 16 08:24 pts/0 00:00:36 java -jar jenkins.war mdanjou 2601 2531 0 08:25 pts/0 00:00:00 /usr/bin/python /tmp/hudson3048464595979281901.sh The python script is still in memory, and still executing. However, Jenkins has cut the ties to the python script. Jenkins must not cut the ties until the script is done. In this comment, the script is a simple example, in real project scripts, the signal handler is used to clean up temporary files, and to terminate gracefully (e.g. killing other spawned processes).
            owenmehegan Owen Mehegan made changes -
            Assignee Kohsuke Kawaguchi [ kohsuke ]
            Hide
            matthewlmcclure matthewlmcclure added a comment -

            I wrote a script that you can execute periodically from cron to clean up processes orphaned by Jenkins.

            Show
            matthewlmcclure matthewlmcclure added a comment - I wrote a script that you can execute periodically from cron to clean up processes orphaned by Jenkins.
            Hide
            deepchip Martin d'Anjou added a comment -

            There is more to it than cleaning up the orphaned processes, which by the way should be done by Jenkins and not as an external process. The way this should work is that Jenkins should send the signal (SIGTERM or SIGTERM) and wait for the sub-processes to do their own cleanup. This gives the sub-processes a chance to propagate the signal to sub-sub-processes of their own (which by the way when you use a grid engine, might be running yet on other remote machines that are not Jenkins slaves).

            I modified the first shell script to write to a file during the traps: Jenkins cuts the ties too early and no files show up anywhere.

            #!/bin/bash
            echo "Starting $0"
            echo "Listing traps"
            trap -p
            echo "Setting trap"
            trap 'echo SIGTERM | tee trap.sigterm; kill $pid; exit 15;' SIGTERM
            trap 'echo SIGINT  | tee trap.sigint; kill $pid; exit 2;' SIGINT
            echo "Listing traps again"
            trap -p
            echo "Sleeping"
            sleep 20 & pid=$!
            echo "Waiting"
            wait $pid
            echo "Exit status: $?"
            echo "Ending"
            

            So the SIGINT -> wait N seconds for the build process to return -> SIGKILL (with a user configurable N) would be an acceptable solution. The value of N should be configurable for each job.

            Show
            deepchip Martin d'Anjou added a comment - There is more to it than cleaning up the orphaned processes, which by the way should be done by Jenkins and not as an external process. The way this should work is that Jenkins should send the signal (SIGTERM or SIGTERM) and wait for the sub-processes to do their own cleanup. This gives the sub-processes a chance to propagate the signal to sub-sub-processes of their own (which by the way when you use a grid engine, might be running yet on other remote machines that are not Jenkins slaves). I modified the first shell script to write to a file during the traps: Jenkins cuts the ties too early and no files show up anywhere. #!/bin/bash echo "Starting $0" echo "Listing traps" trap -p echo "Setting trap" trap 'echo SIGTERM | tee trap.sigterm; kill $pid; exit 15;' SIGTERM trap 'echo SIGINT | tee trap.sigint; kill $pid; exit 2;' SIGINT echo "Listing traps again" trap -p echo "Sleeping" sleep 20 & pid=$! echo "Waiting" wait $pid echo "Exit status: $?" echo "Ending" So the SIGINT -> wait N seconds for the build process to return -> SIGKILL (with a user configurable N) would be an acceptable solution. The value of N should be configurable for each job.
            Hide
            appid AppId Man added a comment -

            I'm also affected by this issue and would highly appreciate the solution proposed by Martin d'Anjou, in which Jenkins waits (a configurable amount of time) for its children to finish.

            Will this be implemented in the near future?

            Show
            appid AppId Man added a comment - I'm also affected by this issue and would highly appreciate the solution proposed by Martin d'Anjou, in which Jenkins waits (a configurable amount of time) for its children to finish. Will this be implemented in the near future?
            Hide
            tintinwebweb tintin tintin added a comment -

            I see the exact same issue as described in comment-182402.

            I am utilizing the execute python-script build-step to invoke pretty long lasting python processes (parent) that also spawn multiple sub-processes on-demand which are subject to be managed by the parent. I've implemented proper signal handling in order to clean up child processes and threads whenever the parent gets terminated. Unfortunately it looks like - as described in comment-182402 - that jenkins notifies the parent but does not wait for the parent to cleanup and terminate but instead detaches from the process leaving it in an zombie like state. In my case I keep finding processes sitting in futex calls waiting for a lock on a resource that never gets unlocked.

            Clean-up bash scripts are no option as they do not prevent the process from locking, thus some of the external resources that are also locked by my script will never get freed. I see the option to make jenkins wait for the hudson<...>.py process to gracefully terminate and optionally force termination in case the procs cleanup lasts too long.

            I'd appreciate any clues on fixing this issue.
            Thanks

            Show
            tintinwebweb tintin tintin added a comment - I see the exact same issue as described in comment-182402 . I am utilizing the execute python-script build-step to invoke pretty long lasting python processes (parent) that also spawn multiple sub-processes on-demand which are subject to be managed by the parent. I've implemented proper signal handling in order to clean up child processes and threads whenever the parent gets terminated. Unfortunately it looks like - as described in comment-182402 - that jenkins notifies the parent but does not wait for the parent to cleanup and terminate but instead detaches from the process leaving it in an zombie like state. In my case I keep finding processes sitting in futex calls waiting for a lock on a resource that never gets unlocked. Clean-up bash scripts are no option as they do not prevent the process from locking, thus some of the external resources that are also locked by my script will never get freed. I see the option to make jenkins wait for the hudson<...>.py process to gracefully terminate and optionally force termination in case the procs cleanup lasts too long. I'd appreciate any clues on fixing this issue. Thanks
            Hide
            sandor_balazsi Sandor Balazsi added a comment -

            Is there any progress on this issue?

            We are using jenkins to start a java based test framework.
            This tool has a couple of java shutdown hook defined that
            must be executed on the termination of java process.

            Due to this problem jenkins does not wait for the proper
            termination of our java process.

            Show
            sandor_balazsi Sandor Balazsi added a comment - Is there any progress on this issue? We are using jenkins to start a java based test framework. This tool has a couple of java shutdown hook defined that must be executed on the termination of java process. Due to this problem jenkins does not wait for the proper termination of our java process.
            Hide
            yevkov Yevgen Kovalienia added a comment -

            Hi all,

            I have the same problem, and would appreciate that solution, described by Martin.
            Is anybody working on implementation?

            Show
            yevkov Yevgen Kovalienia added a comment - Hi all, I have the same problem, and would appreciate that solution, described by Martin. Is anybody working on implementation?
            Hide
            danielbeck Daniel Beck added a comment - - edited

            (To clarify, this comment is about the issue as reported, not any other process killing issues discussed in comments.)

            Jenkins preferably uses the java.lang.UNIXProcess.destroy(...) method in the JRE running Jenkins.

            In OpenJDK 7 and up it seems to send SIGTERM, which is consistent with my observations below.

            http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/27e0909d3fa0/src/solaris/native/java/lang/UNIXProcess_md.c#l722 (parameter is "false")
            http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/solaris/native/java/lang/UNIXProcess_md.c#l720 (parameter is "false")
            http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/solaris/native/java/lang/UNIXProcess_md.c#l947

            The call from Jenkins:
            https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/util/ProcessTree.java#L580

            A build output of a very simple shell script demonstrating that SIGTERM is handled:

            Building on master in workspace /var/lib/jenkins/workspace/jobname
            [jobname] $ /bin/sh -xe /tmp/hudson6478022098890718097.sh
            + trap 'echo TERM' TERM
            + sleep 50
            Terminated
            ++ echo TERM
            TERM
            Build was aborted
            Aborted by Daniel Beck
            Finished: ABORTED

            So check your JRE's source code or documentation to see whether/how UNIXProcess is implemented. OpenJDK (in my case OpenJDK 1.7.0.45) seems to behave.

            That said, the logging of hudson.util.ProcessTree might be interesting. Log on FINER or higher.

            Show
            danielbeck Daniel Beck added a comment - - edited (To clarify, this comment is about the issue as reported, not any other process killing issues discussed in comments.) Jenkins preferably uses the java.lang.UNIXProcess.destroy(...) method in the JRE running Jenkins. In OpenJDK 7 and up it seems to send SIGTERM, which is consistent with my observations below. http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/27e0909d3fa0/src/solaris/native/java/lang/UNIXProcess_md.c#l722 (parameter is "false") http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/solaris/native/java/lang/UNIXProcess_md.c#l720 (parameter is "false") http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/solaris/native/java/lang/UNIXProcess_md.c#l947 The call from Jenkins: https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/util/ProcessTree.java#L580 A build output of a very simple shell script demonstrating that SIGTERM is handled: Building on master in workspace / var /lib/jenkins/workspace/jobname [jobname] $ /bin/sh -xe /tmp/hudson6478022098890718097.sh + trap 'echo TERM' TERM + sleep 50 Terminated ++ echo TERM TERM Build was aborted Aborted by Daniel Beck Finished: ABORTED So check your JRE's source code or documentation to see whether/how UNIXProcess is implemented. OpenJDK (in my case OpenJDK 1.7.0.45) seems to behave. That said, the logging of hudson.util.ProcessTree might be interesting. Log on FINER or higher.
            Show
            pyrolistical Ronald Chen added a comment - Looks like its SIGTERM in jdk6 as well: http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/b2317f5542ce/src/solaris/native/java/lang/UNIXProcess_md.c#l684
            Hide
            pyrolistical Ronald Chen added a comment -

            Something is odd. The trap is working when Jenkins is on Ubuntu 12.10 but not on CentOS 6.3

            Show
            pyrolistical Ronald Chen added a comment - Something is odd. The trap is working when Jenkins is on Ubuntu 12.10 but not on CentOS 6.3
            Hide
            deepchip Martin d'Anjou added a comment - - edited

            This has gone from bad to worst. I have non-concurrent builds running back to back. When the first one is killed, it somehow keeps running in the background while the other one starts in the same workspace and fails when it should have passed.

            Daniel Beck: how do I set the Log to FINER or higher on the process tree, and where do I look up the log? Give me urls please, I sometimes don't understand all the jargon.

            This is the Java I am using:

            /usr/java/jdk1.7/bin/java -version
            java version "1.7.0_51"
            Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
            Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
            

            I run jenkins as /usr/java/jdk1.7/bin/java -jar jenkins.war

            Show
            deepchip Martin d'Anjou added a comment - - edited This has gone from bad to worst. I have non-concurrent builds running back to back. When the first one is killed, it somehow keeps running in the background while the other one starts in the same workspace and fails when it should have passed. Daniel Beck: how do I set the Log to FINER or higher on the process tree, and where do I look up the log? Give me urls please, I sometimes don't understand all the jargon. This is the Java I am using: /usr/java/jdk1.7/bin/java -version java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) I run jenkins as /usr/java/jdk1.7/bin/java -jar jenkins.war
            Hide
            danielbeck Daniel Beck added a comment -

            Martin d'Anjou:

            This has gone from bad to worst

            Unhelpful statement without mentioning the involved Jenkins versions. Which were bad, which are worse?

            how do I set the Log to FINER or higher on the process tree, and where do I look up the log?

            Go to http://jenkins/log, create a new log recorder (use any name), add a logger named hudson.util.ProcessTree and set level to FINER. Save. Go to the log recorder's page occassionally when the issue occurs to see what it logs.

            Show
            danielbeck Daniel Beck added a comment - Martin d'Anjou : This has gone from bad to worst Unhelpful statement without mentioning the involved Jenkins versions. Which were bad, which are worse? how do I set the Log to FINER or higher on the process tree, and where do I look up the log? Go to http://jenkins/log , create a new log recorder (use any name), add a logger named hudson.util.ProcessTree and set level to FINER . Save. Go to the log recorder's page occassionally when the issue occurs to see what it logs.
            Hide
            deepchip Martin d'Anjou added a comment -

            Sorry I should have been more useful in my comment. By worst I meant that I have found that a killed job can corrupt the current job's workspace. I have found a way to reproduce this corruption 100% of the time.

            I use Jenkins 1.578 and Java SE JRE 1.7.0_45-b18) Java HotSpot 64-bit Server VM (build 24.35-b08).

            I launch jenkins from linux RHEL 6.4 (Santiago) with java -jar jenkins.war

            The job needs to be configured with the following script (it is a variation on the python script above):

            #!/usr/bin/python
            import signal
            import time
            import os
            def handler(*ignored):
                time.sleep(120)
                fh = open("a_file.txt","a")
                fh.write("Handler of Build number: "+os.environ['BUILD_NUMBER'])
                fh.close()
            
            signal.signal(signal.SIGTERM, handler)
            fh = open("a_file.txt","w")
            fh.write("Main of Build number: "+os.environ['BUILD_NUMBER'])
            fh.close()
            time.sleep(120)
            

            Then configure the job to archive the artifact named a_file.txt
            Run two jobs back to back, kill the first one shortly after it started. Leave the second one to complete until it ends normally.

            The log as configured in the above comment, shows:

            killAll: process=java.lang.UNIXProcess@3d7c07c9 and envs={HUDSON_COOKIE=06668ba4-b481-4a17-86b3-5f4fbd4061b2}
            Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
            Recursively killing pid=1840
            Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
            Killing pid=1840
            Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
            Recursively killing pid=1840
            Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
            Killing pid=1840
            

            The unix process table, after the kill, shows that both jobs are still running:

            mdanjou   1251   953  0 10:07 pts/30   00:01:46           java -jar jenkins.war
            mdanjou   1840  1251  0 15:30 pts/30   00:00:00             /usr/bin/python /tmp/hudson6469713064377741807.sh
            mdanjou   1851  1251  0 15:30 pts/30   00:00:00             /usr/bin/python /tmp/hudson1969984296722384280.sh
            

            Both jobs are still running.

            When the second job completes, examine its artifact. It contains this:

            Main of Build number: 18Handler of Build number: 17
            

            So the killed build (#17) corrupts the workspace of the running build (#18).

            Show
            deepchip Martin d'Anjou added a comment - Sorry I should have been more useful in my comment. By worst I meant that I have found that a killed job can corrupt the current job's workspace. I have found a way to reproduce this corruption 100% of the time. I use Jenkins 1.578 and Java SE JRE 1.7.0_45-b18) Java HotSpot 64-bit Server VM (build 24.35-b08). I launch jenkins from linux RHEL 6.4 (Santiago) with java -jar jenkins.war The job needs to be configured with the following script (it is a variation on the python script above): #!/usr/bin/python import signal import time import os def handler(*ignored): time.sleep(120) fh = open("a_file.txt","a") fh.write("Handler of Build number: "+os.environ['BUILD_NUMBER']) fh.close() signal.signal(signal.SIGTERM, handler) fh = open("a_file.txt","w") fh.write("Main of Build number: "+os.environ['BUILD_NUMBER']) fh.close() time.sleep(120) Then configure the job to archive the artifact named a_file.txt Run two jobs back to back, kill the first one shortly after it started. Leave the second one to complete until it ends normally. The log as configured in the above comment, shows: killAll: process=java.lang.UNIXProcess@3d7c07c9 and envs={HUDSON_COOKIE=06668ba4-b481-4a17-86b3-5f4fbd4061b2} Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree Recursively killing pid=1840 Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree Killing pid=1840 Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree Recursively killing pid=1840 Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree Killing pid=1840 The unix process table, after the kill, shows that both jobs are still running: mdanjou 1251 953 0 10:07 pts/30 00:01:46 java -jar jenkins.war mdanjou 1840 1251 0 15:30 pts/30 00:00:00 /usr/bin/python /tmp/hudson6469713064377741807.sh mdanjou 1851 1251 0 15:30 pts/30 00:00:00 /usr/bin/python /tmp/hudson1969984296722384280.sh Both jobs are still running. When the second job completes, examine its artifact. It contains this: Main of Build number: 18Handler of Build number: 17 So the killed build (#17) corrupts the workspace of the running build (#18).
            Hide
            danielbeck Daniel Beck added a comment -

            Makes sense. I don't see how this could be circumvented. Maybe by waiting a bit to see whether SIGTERM worked, and if not, send SIGKILL? But Jenkins uses the JRE's abstraction of "kill a Unix process" and that behavior appears to be implementation dependent.

            Should be possible to write a plugin that sends SIGKILL if configured (e.g. for specific jobs only). Would that help?

            Show
            danielbeck Daniel Beck added a comment - Makes sense. I don't see how this could be circumvented. Maybe by waiting a bit to see whether SIGTERM worked, and if not, send SIGKILL? But Jenkins uses the JRE's abstraction of "kill a Unix process" and that behavior appears to be implementation dependent. Should be possible to write a plugin that sends SIGKILL if configured (e.g. for specific jobs only). Would that help?
            Hide
            deepchip Martin d'Anjou added a comment -

            Maximum flexibility, as a plugin or built-in, in my view and without regards to feasibility, would be:

            • wait a configurable amount of time for the SIGTERM killed process to come to its natural completion (i.e. let it run its traps/handlers)
            • if not dead by the timeout, send SIGKILL and wait for process to be gone (N seconds, configurable)
            • If not dead, move on to the next job or hang (as determined by the user - sometimes hanging is the right thing: spectacular failures are usually easy to debug but it's a judgement call)
            • When moving on, perform the post-build steps

            Regarding the last point, I am not sure whether Jenkins is supposed to perform the post-build steps when a build is killed by the user - but it is certainly something that would help me. Perhaps this is something that could be configured?

            I do not know what would belong to a plugin vs. what should be built-in.

            Show
            deepchip Martin d'Anjou added a comment - Maximum flexibility, as a plugin or built-in, in my view and without regards to feasibility, would be: wait a configurable amount of time for the SIGTERM killed process to come to its natural completion (i.e. let it run its traps/handlers) if not dead by the timeout, send SIGKILL and wait for process to be gone (N seconds, configurable) If not dead, move on to the next job or hang (as determined by the user - sometimes hanging is the right thing: spectacular failures are usually easy to debug but it's a judgement call) When moving on, perform the post-build steps Regarding the last point, I am not sure whether Jenkins is supposed to perform the post-build steps when a build is killed by the user - but it is certainly something that would help me. Perhaps this is something that could be configured? I do not know what would belong to a plugin vs. what should be built-in.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Øyvind Harboe
            Path:
            src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/hudsontrigger/GerritTrigger.java
            http://jenkins-ci.org/commit/gerrit-trigger-plugin/0eff041d3388cc8a2dba3367f3f0b131d19c018c
            Log:
            adds workaround for JENKINS-17116

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Øyvind Harboe Path: src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/hudsontrigger/GerritTrigger.java http://jenkins-ci.org/commit/gerrit-trigger-plugin/0eff041d3388cc8a2dba3367f3f0b131d19c018c Log: adds workaround for JENKINS-17116
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Robert Sandell
            Path:
            src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/hudsontrigger/GerritTrigger.java
            http://jenkins-ci.org/commit/gerrit-trigger-plugin/a9de6534418bbeddf0ae449bae33b0a28b510ed5
            Log:
            Merge pull request #224 from zylin/jenkins-17116-workaround

            adds workaround for JENKINS-17116

            Compare: https://github.com/jenkinsci/gerrit-trigger-plugin/compare/afa1cff24324...a9de6534418b

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Robert Sandell Path: src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/hudsontrigger/GerritTrigger.java http://jenkins-ci.org/commit/gerrit-trigger-plugin/a9de6534418bbeddf0ae449bae33b0a28b510ed5 Log: Merge pull request #224 from zylin/jenkins-17116-workaround adds workaround for JENKINS-17116 Compare: https://github.com/jenkinsci/gerrit-trigger-plugin/compare/afa1cff24324...a9de6534418b
            Hide
            akostadinov akostadinov added a comment -

            I am wondering, because universal solution might not be that easy, would it be possible to have a hook {{ gracefulShutdown }} where one can have a custom implementation before the regular {{ kill -9 }} kicks in?

            Show
            akostadinov akostadinov added a comment - I am wondering, because universal solution might not be that easy, would it be possible to have a hook {{ gracefulShutdown }} where one can have a custom implementation before the regular {{ kill -9 }} kicks in?
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 147947 ] JNJira + In-Review [ 177041 ]
            Hide
            mbells Matthew Bells added a comment -

            I'm also have problems with this.
            In particular, our nodes are running Ubuntu 14.04. We are using Jenkins to run some tests as part of the build. There are a few steps where interruption will cause communication failures, leaked temporary files gigabytes in size, and locks that are not undone. Orphaned process is very bad as well, since this could lead to a new build starting communication on the same channel prior to the previous one terminating.

            Like Martin d'Anjou indicated, we would need a timeout parameter, since the allowed timeout for SIGTERM may be about 300sec, which is probably a lot longer than someone implementing this fix may anticipate.

            Show
            mbells Matthew Bells added a comment - I'm also have problems with this. In particular, our nodes are running Ubuntu 14.04. We are using Jenkins to run some tests as part of the build. There are a few steps where interruption will cause communication failures, leaked temporary files gigabytes in size, and locks that are not undone. Orphaned process is very bad as well, since this could lead to a new build starting communication on the same channel prior to the previous one terminating. Like Martin d'Anjou indicated, we would need a timeout parameter, since the allowed timeout for SIGTERM may be about 300sec, which is probably a lot longer than someone implementing this fix may anticipate.
            Hide
            kashierez Erez Kashi added a comment -

            I have the same issue. When canceling Job, I am trying to signal in side a python script, and cleanup.
            It there any workaround for this issue?

            Show
            kashierez Erez Kashi added a comment - I have the same issue. When canceling Job, I am trying to signal in side a python script, and cleanup. It there any workaround for this issue?
            Hide
            akostadinov akostadinov added a comment - - edited

            Erez Kashi, it is possible to:
            1. remove the jenkins cookie environment variable
            2. run your program in background (output still can go to stdout)
            3. launch another background process to check original process PID, such that when it is gone, it would kill the other child gracefully
            4. in main process, wait for the other two to complete (make sure second monitor process would exit if the first background process exits)
            5. take care to report proper termination status of the program

            Not very nice probably but you can script it to run arbitrary shell commands in this way. Also might not be worth the effort. It didn't for me.

            Forgot to mention: this would only work on UNIX derivatives IIRC.

            Show
            akostadinov akostadinov added a comment - - edited Erez Kashi , it is possible to: 1. remove the jenkins cookie environment variable 2. run your program in background (output still can go to stdout) 3. launch another background process to check original process PID, such that when it is gone, it would kill the other child gracefully 4. in main process, wait for the other two to complete (make sure second monitor process would exit if the first background process exits) 5. take care to report proper termination status of the program Not very nice probably but you can script it to run arbitrary shell commands in this way. Also might not be worth the effort. It didn't for me. Forgot to mention: this would only work on UNIX derivatives IIRC.
            Hide
            kashierez Erez Kashi added a comment -

            Thanks for the quick response . I will try ...

            Show
            kashierez Erez Kashi added a comment - Thanks for the quick response . I will try ...
            Hide
            deepchip Martin d'Anjou added a comment -

            I have done more experiments. And I am still not seeing the signal being sent, like Daniel Beck is seeing.

            I started with this Java version:

            • Java1.8.0_77. OS is Fedora release 14 (Laughlin).

            The Jenkins console shows:

            [freestyle-kill] $ /bin/sh -xe /tmp/hudson3073245061937599649.sh
            + trap 'echo TERM >terminated.txt' TERM
            + sleep 120
            Build was aborted
            Aborted by martinda
            Finished: ABORTED
            

            Observe that the script is not printing "TERM" to the file, like it does in Daniel's environment.

            I also tried these Java versions ones:

            • Java OpenJDK 1.8.0_111, Red Hat Enterprise Linux Server release 6.6 (Santiago)
            • Java HotSpot 1.8.0_121, Ubuntu 16.04 LTS (Xenial Xerus)
            • Java OpenJDK 1.8.0_121, Ubuntu 16.04 LTS (Xenial Xerus)

            I captured some logs using OpenJDK 1.8.0_121 on Ubuntu 16.04. In the terminal running Jenkins:

            INFO: jenkins-17116/freestyle #4 aborted
            java.lang.InterruptedException
                    at java.lang.Object.wait(Native Method)
                    at java.lang.Object.wait(Object.java:502)
                    at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
                    at hudson.Proc$LocalProc.join(Proc.java:318)
                    at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:135)
                    at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:95)
                    at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:64)
                    at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
                    at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
                    at hudson.model.Build$BuildExecution.build(Build.java:205)
                    at hudson.model.Build$BuildExecution.doRun(Build.java:162)
                    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
                    at hudson.model.Run.execute(Run.java:1720)
                    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
                    at hudson.model.ResourceController.execute(ResourceController.java:98)
                    at hudson.model.Executor.run(Executor.java:404)
            

            In the jenkins log recorder:

            Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
            killAll: process=java.lang.UNIXProcess@2a8af379 and envs={HUDSON_COOKIE=2d16a893-7e22-4360-aad0-0931104599a5}
            Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
            Recursively killing pid=25054
            Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
            Recursively killing pid=25055
            Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
            Killing pid=25055
            Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
            Killing pid=25054
            

            None of them traps the signal.

            Show
            deepchip Martin d'Anjou added a comment - I have done more experiments. And I am still not seeing the signal being sent, like Daniel Beck is seeing . I started with this Java version: Java1.8.0_77. OS is Fedora release 14 (Laughlin). The Jenkins console shows: [freestyle-kill] $ /bin/sh -xe /tmp/hudson3073245061937599649.sh + trap 'echo TERM >terminated.txt' TERM + sleep 120 Build was aborted Aborted by martinda Finished: ABORTED Observe that the script is not printing "TERM" to the file, like it does in Daniel's environment. I also tried these Java versions ones: Java OpenJDK 1.8.0_111, Red Hat Enterprise Linux Server release 6.6 (Santiago) Java HotSpot 1.8.0_121, Ubuntu 16.04 LTS (Xenial Xerus) Java OpenJDK 1.8.0_121, Ubuntu 16.04 LTS (Xenial Xerus) I captured some logs using OpenJDK 1.8.0_121 on Ubuntu 16.04. In the terminal running Jenkins: INFO: jenkins-17116/freestyle #4 aborted java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395) at hudson.Proc$LocalProc.join(Proc.java:318) at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:135) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:95) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:64) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779) at hudson.model.Build$BuildExecution.build(Build.java:205) at hudson.model.Build$BuildExecution.doRun(Build.java:162) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534) at hudson.model.Run.execute(Run.java:1720) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:404) In the jenkins log recorder: Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree killAll: process=java.lang.UNIXProcess@2a8af379 and envs={HUDSON_COOKIE=2d16a893-7e22-4360-aad0-0931104599a5} Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree Recursively killing pid=25054 Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree Recursively killing pid=25055 Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree Killing pid=25055 Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree Killing pid=25054 None of them traps the signal.
            danielbeck Daniel Beck made changes -
            Assignee Kohsuke Kawaguchi [ kohsuke ]
            robinjarry Robin Jarry made changes -
            Assignee Robin Jarry [ robinjarry ]
            Hide
            robinjarry Robin Jarry added a comment -

            I have 2 patches that add support for killing launched processes with specific signals.

            I'll submit them in a PR asap.

            This is unfortunately a *NIX-only solution where signals are supported. For windows I don’t know what to do.

            Show
            robinjarry Robin Jarry added a comment - I have 2 patches that add support for killing launched processes with specific signals. I'll submit them in a PR asap. This is unfortunately a *NIX-only solution where signals are supported. For windows I don’t know what to do.
            Hide
            mpistell Matthew Pistella added a comment -

            Did those patches get merged?

            Show
            mpistell Matthew Pistella added a comment - Did those patches get merged?
            Hide
            hashar Antoine Musso added a comment - - edited

            Jenkins does send a SIGTERM and when running scripts that is usually /bin/sh eg:

            /bin/sh -xe /tmp/hudson013456789.sh
            

            Most probably /bin/sh is bash. When bash receives a SIGTERM while executing a child process, it does not relay it to the child process.

            The sh scripts does get terminated, but the child process keeps running behind, reattached to the parent process, and I guess Jenkins can't find it anymore.

            A fix when you have a single command is to prefix it with exec. Eg instead of:

            somebuildtool
            

            do:

            exec somebuildtool
            

            /bin/sh will be replaced by somebuildtool and directly receives the signal. The drawback is that you cant run anymore command after. To do that you need to background each command, wait for it to terminate or get a signal then resend the signal. Something like:

            somebuildtool &
            apid=$!
            trap 'kill -SIGTERM $apid; wait $apid' SIGTERM
            wait
            
            Show
            hashar Antoine Musso added a comment - - edited Jenkins does send a SIGTERM and when running scripts that is usually /bin/sh eg: /bin/sh -xe /tmp/hudson013456789.sh Most probably /bin/sh is bash. When bash receives a SIGTERM while executing a child process, it does not relay it to the child process. The sh scripts does get terminated, but the child process keeps running behind, reattached to the parent process, and I guess Jenkins can't find it anymore. A fix when you have a single command is to prefix it with exec . Eg instead of: somebuildtool do: exec somebuildtool /bin/sh will be replaced by somebuildtool and directly receives the signal. The drawback is that you cant run anymore command after. To do that you need to background each command, wait for it to terminate or get a signal then resend the signal. Something like: somebuildtool & apid=$! trap 'kill -SIGTERM $apid; wait $apid' SIGTERM wait
            Hide
            juliccr JULIAN CRUZ CANADA RACINET added a comment -

            Is this issue going to be solved ? 

            I am unable to trap the signal. The pipeline progress log states Sending interrupt signal to process , but although I am trapping SIGINT and also SIGTERM within my shell script, it's not working. Seems like it's sending a SIGKILL, could that be? 

            Show
            juliccr JULIAN CRUZ CANADA RACINET added a comment - Is this issue going to be solved ?  I am unable to trap the signal. The pipeline progress log states Sending interrupt signal to process , but although I am trapping SIGINT and also SIGTERM  within my shell script, it's not working. Seems like it's sending a SIGKILL , could that be? 
            Hide
            hashar Antoine Musso added a comment -

            Julian, what is your script doing exactly? AFAIK when a build is aborted jenkins immediately close the stdout/stderr connections, so your traps message would not be shown on the console. If you get your trap to redirect to a file, you would then be able to tell it reacted properly by looking at the file on the slave.

            Show
            hashar Antoine Musso added a comment - Julian, what is your script doing exactly? AFAIK when a build is aborted jenkins immediately close the stdout/stderr connections, so your traps message would not be shown on the console. If you get your trap to redirect to a file, you would then be able to tell it reacted properly by looking at the file on the slave.
            Hide
            juliccr JULIAN CRUZ CANADA RACINET added a comment -

            Ok, thanks Antoine Musso I'll try redirecting to a file the stdout. We have a shell script that runs some dockerized integration tests and when we cancel the build, we wanted to gracefully tear down the containers.

            Thanks for the info

            Show
            juliccr JULIAN CRUZ CANADA RACINET added a comment - Ok, thanks Antoine Musso I'll try redirecting to a file the stdout. We have a shell script that runs some dockerized integration tests and when we cancel the build, we wanted to gracefully tear down the containers. Thanks for the info
            Hide
            hashar Antoine Musso added a comment -

            Docker is the reason I came here.

            A note there is a bug in Docker (as of 17.06 and since October 2013) which is that when you use docker run --tty, the signals are not proxified to the daemon, hence the signal is never forwarded to the docker daemon and the container is left running. You can read about my finding at https://github.com/moby/moby/issues/9098#issuecomment-347536699

            The way I went in Jenkins is to use:

            exec docker run someimage
            # nothing will be run after that due to exec which replaces the shell
            

            This way the shell script started by the Jenkins agent is replaced by the docker command (due to exec). When the agent kill the process, 'docker' receives the SIGTERM and forward it to the daemon (note there is no --tty which would disable that forwarding).

            And in the container entry point you might need trap handlers for SIGTERM / SIGINT. A rough example is https://gerrit.wikimedia.org/r/#/c/389937/5/dockerfiles/tox/run.sh

            Some random mess at https://phabricator.wikimedia.org/T176747 , but I would not recommend reading it :]

            Show
            hashar Antoine Musso added a comment - Docker is the reason I came here. A note there is a bug in Docker (as of 17.06 and since October 2013) which is that when you use docker run --tty, the signals are not proxified to the daemon, hence the signal is never forwarded to the docker daemon and the container is left running. You can read about my finding at https://github.com/moby/moby/issues/9098#issuecomment-347536699 The way I went in Jenkins is to use: exec docker run someimage # nothing will be run after that due to exec which replaces the shell This way the shell script started by the Jenkins agent is replaced by the docker command (due to exec). When the agent kill the process, 'docker' receives the SIGTERM and forward it to the daemon (note there is no --tty which would disable that forwarding). And in the container entry point you might need trap handlers for SIGTERM / SIGINT. A rough example is https://gerrit.wikimedia.org/r/#/c/389937/5/dockerfiles/tox/run.sh Some random mess at https://phabricator.wikimedia.org/T176747 , but I would not recommend reading it :]
            Hide
            hashar Antoine Musso added a comment -

            An alternative is to keep the container id around, and when the build ends/get aborted, find a way to 'docker stop' the container.

            Something like:

            docker run --cidfile container.pid
            

            And in a publisher (not sure it run when a job is aborted though):

            docker stop --time=5 <(cat container.pid) || /bin/true
            

            Which would instruct Docker to stop the container.

            I think there is one of the Jenkins docker plugin which does exactly. The steps being executed by the Jenkins agent itself so that is probably a bit more robust than defining those steps in a job.

            Show
            hashar Antoine Musso added a comment - An alternative is to keep the container id around, and when the build ends/get aborted, find a way to 'docker stop' the container. Something like: docker run --cidfile container.pid And in a publisher (not sure it run when a job is aborted though): docker stop --time=5 <(cat container.pid) || /bin/ true Which would instruct Docker to stop the container. I think there is one of the Jenkins docker plugin which does exactly. The steps being executed by the Jenkins agent itself so that is probably a bit more robust than defining those steps in a job.
            Hide
            juliccr JULIAN CRUZ CANADA RACINET added a comment -

            Thanks Antoine! we'll definitely try that.

            Show
            juliccr JULIAN CRUZ CANADA RACINET added a comment - Thanks Antoine! we'll definitely try that.
            Hide
            sreiter Stephan Reiter added a comment -

            Check out pull request https://github.com/jenkinsci/jenkins/pull/3414

            I added code that will make Jenkins wait for process termination (for up to 30secs, should be made configurable).

            Behavior changes are as follows:

            • On Windows, Jenkins sends Ctrl+C for up to 30secs. If the process hasn't exitted by then, it will be terminated like before.
            • On Linux, we send SIGTERMs for up to 30secs. If the process is still around after that, we continue as before: we close stdin/stdout/stderr which causes the process to terminate. (Note that we could send SIGKILL.)

            Note that Jenkins doesn't use SIGKILL! It uses SIGTERM, but doesn't give the process any time to handle it before closing stdin/stdout/stderr.

             

            Show
            sreiter Stephan Reiter added a comment - Check out pull request https://github.com/jenkinsci/jenkins/pull/3414 I added code that will make Jenkins wait for process termination (for up to 30secs, should be made configurable). Behavior changes are as follows: On Windows, Jenkins sends Ctrl+C for up to 30secs. If the process hasn't exitted by then, it will be terminated like before. On Linux, we send SIGTERMs for up to 30secs. If the process is still around after that, we continue as before: we close stdin/stdout/stderr which causes the process to terminate. (Note that we could send SIGKILL.) Note that Jenkins doesn't use SIGKILL! It uses SIGTERM, but doesn't give the process any time to handle it before closing stdin/stdout/stderr.  
            Hide
            rahulnans Rahul Mahajan added a comment -

            Hi,

            Any update regarding this issue? I am facing the same issue, where I am able to terminate gracefully using the command line, but when I issue the job using the Jenkins and abort it, it wont end gracefully.

            It has to be noted, that if I abort using using the command line of a running job, the graceful termination is visible on Jenkins as well. But not when we abort using Jenkins, which is quite weird.

            Show
            rahulnans Rahul Mahajan added a comment - Hi, Any update regarding this issue? I am facing the same issue, where I am able to terminate gracefully using the command line, but when I issue the job using the Jenkins and abort it, it wont end gracefully. It has to be noted, that if I abort using using the command line of a running job, the graceful termination is visible on Jenkins as well. But not when we abort using Jenkins, which is quite weird.
            Hide
            msinclair Mark Sinclair added a comment -

            It maybe a good idea to create a plugin that remaps the abort button to run a script specified in each job.  It could be optionally configured to run the script, then do the normal abort process.  For example:

            1. User abort triggered.

            2. Job specific abort script runs without interrupting what the job is currently running.  A timeout counter starts simultaneously.

            3. At completion of the script or timeout, Jenkins checks the job status.  If the job is ready to exit normally, it does so - this will allow final status other than abort.  If the job is still running, the normal Jenkins abort procedure takes over.

            Show
            msinclair Mark Sinclair added a comment - It maybe a good idea to create a plugin that remaps the abort button to run a script specified in each job.  It could be optionally configured to run the script, then do the normal abort process.  For example: 1. User abort triggered. 2. Job specific abort script runs without interrupting what the job is currently running.  A timeout counter starts simultaneously. 3. At completion of the script or timeout, Jenkins checks the job status.  If the job is ready to exit normally, it does so - this will allow final status other than abort.  If the job is still running, the normal Jenkins abort procedure takes over.
            Hide
            sreiter Stephan Reiter added a comment - - edited

            Well, I am still pursuing the graceful-termination via SIGTERM. A stepping stone for this is ready to be merged into a library that is used by Jenkins for process management on Windows. After that has happened, a new version of that library needs to be bundled with Jenkins and then used in my pull request.

            Show
            sreiter Stephan Reiter added a comment - - edited Well, I am still pursuing the graceful-termination via SIGTERM. A stepping stone for this is ready to be merged into a library that is used by Jenkins for process management on Windows. After that has happened, a new version of that library needs to be bundled with Jenkins and then used in my pull request.
            Hide
            rahulnans Rahul Mahajan added a comment -

            Hey, Stephan Reiter can we use the library(if its available) in our existing Jenkins setup? And if yes, how can we do that? Thanks for your help!

            Show
            rahulnans Rahul Mahajan added a comment - Hey, Stephan Reiter can we use the library(if its available) in our existing Jenkins setup? And if yes, how can we do that? Thanks for your help!
            Hide
            sreiter Stephan Reiter added a comment -

            Hi Rahul,

            The library alone is not enough - Jenkins needs to be recompiled to use it.

            WinP is the library I was talking about and needs to get the following change applied: https://github.com/kohsuke/winp/pull/49
            After that we can recompile Jenkins with the new WinP library and this change: https://github.com/jenkinsci/jenkins/pull/3414

            I am really hoping that we can move faster with those changes. I am usually somewhat patient, but the perceived lack of interest from Jenkins maintainers is hard to understand.

            Show
            sreiter Stephan Reiter added a comment - Hi Rahul, The library alone is not enough - Jenkins needs to be recompiled to use it. WinP is the library I was talking about and needs to get the following change applied: https://github.com/kohsuke/winp/pull/49 After that we can recompile Jenkins with the new WinP library and this change: https://github.com/jenkinsci/jenkins/pull/3414 I am really hoping that we can move faster with those changes. I am usually somewhat patient, but the perceived lack of interest from Jenkins maintainers is hard to understand.
            Hide
            rahulnans Rahul Mahajan added a comment -

            Thanks Stephan Reiter..Will wait for it, until then we can work around the problem.

            Show
            rahulnans Rahul Mahajan added a comment - Thanks Stephan Reiter ..Will wait for it, until then we can work around the problem.
            Hide
            sreiter Stephan Reiter added a comment -

            Cool, Rahul.  We've been working around it for 4 years. But now we got tired of the performance penalty the workarounds introduce for us, so I said: How hard can it be (to fix the problem)? Turns out, not hard. Just getting it done is. sigh

            Show
            sreiter Stephan Reiter added a comment - Cool, Rahul.  We've been working around it for 4 years. But now we got tired of the performance penalty the workarounds introduce for us, so I said: How hard can it be (to fix the problem)? Turns out, not hard. Just getting it done is. sigh
            oleg_nenashev Oleg Nenashev made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            oleg_nenashev Oleg Nenashev made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            I believe it is going to land in the next weekly

            Show
            oleg_nenashev Oleg Nenashev added a comment - I believe it is going to land in the next weekly
            oleg_nenashev Oleg Nenashev made changes -
            Remote Link This issue links to "https://github.com/jenkinsci/jenkins/pull/3414 (Web Link)" [ 21430 ]
            Show
            hashar Antoine Musso added a comment - The merge https://github.com/jenkinsci/jenkins/commit/d8eac92ee9a1c19bf145763589f1c152607bf3ed is in tag jenkins-2.141
            Hide
            deepchip Martin d'Anjou added a comment - - edited

            With Jenkins 2.141, I ran the bash script, and the python script and there is no change. Jenkins still leaks processes, and still the signals are not trapped by the user script. The is one difference though, the first click on the terminate button (the red [x]) does not kill the job immediately, but that seems to change nothing.

            Show
            deepchip Martin d'Anjou added a comment - - edited With Jenkins 2.141, I ran the bash script , and the python script and there is no change. Jenkins still leaks processes, and still the signals are not trapped by the user script. The is one difference though, the first click on the terminate button (the red  [x] ) does not kill the job immediately, but that seems to change nothing.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Martin d'Anjou Leaking of processes is unrelated to this fix.

            Usual causes:

            • You use 32bit-Java on a 64bit machine
            • You use tool wrappers like Cygwin which mess up the process tree in Windows (See the Cygwin Process Killer plugin)
            • The processes are spawned without inheriting Build reference variables, so the library cannot pick them up if parent processes are already aborted, and the process is orphaned

            I suggest creating a separate issue if none of the above is your case

             

             

             

            Show
            oleg_nenashev Oleg Nenashev added a comment - Martin d'Anjou Leaking of processes is unrelated to this fix. Usual causes: You use 32bit-Java on a 64bit machine You use tool wrappers like Cygwin which mess up the process tree in Windows (See the Cygwin Process Killer plugin) The processes are spawned without inheriting Build reference variables, so the library cannot pick them up if parent processes are already aborted, and the process is orphaned I suggest creating a separate issue if none of the above is your case      
            Hide
            vmagana Victor Magana added a comment -

            Hello, I'm seeing an error in the hudson.util.ProcessTree logger, "External Ctrl+C execution failed for process pid=3872. Ctrl+C process exited with code -1073741515: Failed to attach to the console".  Is there any option/parameter that needs to be set for this to attach and send the Ctrl+C signal. I'm running Jenkins Server on Windows 7x64 version 2.150. Running a Windows batch job on the local master that executes a python script.  Also ran as Execute Python Script job, same error.  Thanks for any help.

             

            Failed to send CTRL+C to pid=3872
            org.jvnet.winp.WinpException: External Ctrl+C execution failed for process pid=3872. Ctrl+C process exited with code -1073741515: Failed to attach to the console (see the AttachConsole WinAPI call). error=0 at winp.cpp:59

            at org.jvnet.winp.Native.sendCtrlC(Native Method)
            at org.jvnet.winp.Native.sendCtrlC(Native.java:90)
            at org.jvnet.winp.WinProcess.sendCtrlC(WinProcess.java:93)
            at hudson.util.ProcessTree$WindowsOSProcess.killSoftly(ProcessTree.java:538)
            at hudson.util.ProcessTree$WindowsOSProcess.killRecursively(ProcessTree.java:517)
            at hudson.util.ProcessTree.killAll(ProcessTree.java:168)
            at hudson.Proc$LocalProc.destroy(Proc.java:384)
            at hudson.Proc$LocalProc.join(Proc.java:357)
            at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
            at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
            at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
            at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
            at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:744)
            at hudson.model.Build$BuildExecution.build(Build.java:206)
            at hudson.model.Build$BuildExecution.doRun(Build.java:163)
            at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
            at hudson.model.Run.execute(Run.java:1810)
            at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
            at hudson.model.ResourceController.execute(ResourceController.java:97)
            at hudson.model.Executor.run(Executor.java:429)

            Show
            vmagana Victor Magana added a comment - Hello, I'm seeing an error in the hudson.util.ProcessTree logger, "External Ctrl+C execution failed for process pid=3872. Ctrl+C process exited with code -1073741515: Failed to attach to the console".  Is there any option/parameter that needs to be set for this to attach and send the Ctrl+C signal. I'm running Jenkins Server on Windows 7x64 version 2.150. Running a Windows batch job on the local master that executes a python script.  Also ran as Execute Python Script job, same error.  Thanks for any help.   Failed to send CTRL+C to pid=3872 org.jvnet.winp.WinpException: External Ctrl+C execution failed for process pid=3872. Ctrl+C process exited with code -1073741515: Failed to attach to the console (see the AttachConsole WinAPI call). error=0 at winp.cpp:59 at org.jvnet.winp.Native.sendCtrlC(Native Method) at org.jvnet.winp.Native.sendCtrlC(Native.java:90) at org.jvnet.winp.WinProcess.sendCtrlC(WinProcess.java:93) at hudson.util.ProcessTree$WindowsOSProcess.killSoftly(ProcessTree.java:538) at hudson.util.ProcessTree$WindowsOSProcess.killRecursively(ProcessTree.java:517) at hudson.util.ProcessTree.killAll(ProcessTree.java:168) at hudson.Proc$LocalProc.destroy(Proc.java:384) at hudson.Proc$LocalProc.join(Proc.java:357) at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:744) at hudson.model.Build$BuildExecution.build(Build.java:206) at hudson.model.Build$BuildExecution.doRun(Build.java:163) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504) at hudson.model.Run.execute(Run.java:1810) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:97) at hudson.model.Executor.run(Executor.java:429)
            Hide
            deepchip Martin d'Anjou added a comment - - edited

            The TERM signal is trapped by the freestyle script when the job runs on the Jenkins master, but when it runs on a node, the signal is not received (or not sent?).

            Show
            deepchip Martin d'Anjou added a comment - - edited The TERM signal is trapped by the freestyle script when the job runs on the Jenkins master, but when it runs on a node, the signal is not received (or not sent?).
            karlparry Karl Parry made changes -
            Comment [ Final Edit - our issue was identified not related to the TERM kill but a rogue jenkins triggering a duplicate job

              ]
            batmat Baptiste Mathus made changes -
            Summary gracefull job termination graceful job termination
            danielbeck Daniel Beck made changes -
            Link This issue is related to JENKINS-55106 [ JENKINS-55106 ]
            Hide
            osmith42 Oliver Smith added a comment -

            I have created the following demo script:

            #!/bin/sh -ex
            
            trap cleanup "TERM"
            set +x
            
            cleanup() {
            	echo "Caught signal, cleaning up..."
            	exit 1
            }
            
            echo "Sleeping..."
            
            while true; do
            	sleep 0.1
            done
            
            # should not get here due to while true
            echo "EOF"
            

            When running in a terminal without jenkins, it catches the signal as expected (e.g. with "pkill -TERM trapscript.sh"):

            $ ./trapscript.sh
            + trap cleanup TERM
            + set +x
            Sleeping...
            Caught signal, cleaning up...
            

            On Jenkins 2.150.2, it does not run the cleanup function:

            [TEST_trap_in_jenkins_job] $ /bin/sh -ex /tmp/jenkins5365212366501463498.sh
            + trap cleanup TERM
            + set +x
            Sleeping...
            Build was aborted
            Aborted by Oliver Smith
            Terminated
            Finished: ABORTED
            

            The server is configured to run all jobs on nodes, so this might be the same problem that Martin d'Anjou pointed out above:
            when it runs on a node, the signal is not received (or not sent?).

            It would be great if somebody could look into this, thanks!

            Show
            osmith42 Oliver Smith added a comment - I have created the following demo script: #!/bin/sh -ex trap cleanup "TERM" set +x cleanup() { echo "Caught signal, cleaning up..." exit 1 } echo "Sleeping..." while true; do sleep 0.1 done # should not get here due to while true echo "EOF" When running in a terminal without jenkins, it catches the signal as expected (e.g. with "pkill -TERM trapscript.sh"): $ ./trapscript.sh + trap cleanup TERM + set +x Sleeping... Caught signal, cleaning up... On Jenkins 2.150.2 , it does not run the cleanup function: [TEST_trap_in_jenkins_job] $ /bin/sh -ex /tmp/jenkins5365212366501463498.sh + trap cleanup TERM + set +x Sleeping... Build was aborted Aborted by Oliver Smith Terminated Finished: ABORTED The server is configured to run all jobs on nodes, so this might be the same problem that Martin d'Anjou pointed out above: when it runs on a node, the signal is not received (or not sent?). It would be great if somebody could look into this, thanks!
            Hide
            deepchip Martin d'Anjou added a comment -

            Pham Vu Tuan do you know how we could debug the communication between master and agents? It seems like the unix kill signal is not sent or received by the agent.

            Show
            deepchip Martin d'Anjou added a comment - Pham Vu Tuan do you know how we could debug the communication between master and agents? It seems like the unix kill signal is not sent or received by the agent.
            owenmehegan Owen Mehegan made changes -
            Assignee Robin Jarry [ robinjarry ]
            Hide
            owenmehegan Owen Mehegan added a comment -

            Martin d'Anjou possibly a question for Jeff Thompson.

            Show
            owenmehegan Owen Mehegan added a comment - Martin d'Anjou possibly a question for Jeff Thompson .
            slonopotamusorama Marat Radchenko made changes -
            Link This issue relates to JENKINS-59152 [ JENKINS-59152 ]

              People

              • Assignee:
                Unassigned
                Reporter:
                markusb Markus Breuer
              • Votes:
                38 Vote for this issue
                Watchers:
                52 Start watching this issue

                Dates

                • Created:
                  Updated: