Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59152

Jenkins fails to properly abort "bat" step

    Details

    • Similar Issues:
    • Released As:
      Jenkins 2.199

      Description

      1. Windows
      2. Jenkins 2.176.1
      3. Create pipeline:
        node() {
          bat "ping 127.0.0.1 -n 100000"
        }
        
      4. Run pipeline
      5. Abort pipeline
      6. View build log

      Expected: pipeline aborts fast and without any issues

      Actual (reproducibility is less than 100%):

      1. It takes pipeline 20s to abort
      2. Build log contains "Click here to forcibly terminate running steps" and "After 20s process did not stop", indicating that Jenkins has issues with stopping the pipeline
      3. "Click here to forcibly terminate running steps" link is still visible even after the build has finished
      4. Sometimes ping processes are NOT terminated even when build has aborted.

      Issue analysis:

      1. There is a race condition between 2 minute timer in hudson.util.ProcessTree.WindowsOSProcess#killSoftly introduced for JENKINS-17116 by PR#3414 and 20s timer in org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.Execution#stop. It is possible for DurableTaskStep to pretend that step was cancelled while it fact process is still running. Because of this race condition, it is possible to trick Jenkins into thinking that build has finished while if fact there are still processes running in workspace and potentially locking files there (this happens to us in practice).
      2. org.jvnet.winp.WinProcess#sendCtrlC that is used in hudson.util.ProcessTree.WindowsOSProcess#killSoftly is NOT a proper way to terminate processes. Many apps do not interpret CTRL+C as a shutdown signal. (cmd.exe being the most important one here, because running bat in pipeline involved TWO cmd.exe - one running jenkins-wrapper.bat and second running jenkins-main.bat. Why you're not using TerminateProcess function from WinAPI?
      3. There's a race condition between gathering of process list in hudson.util.ProcessTree.Windows#Windows constructor and killing of the processes, during which build can produce new processes that will not be attempted to be killed.
      4. Usage of JENKINS_NODE_COOKIE to find what processes to kill is unreliable because 1) processes are free to alter their environment 2) CreateProcessA allows to pass custom environment variables 3) It has unpredictable order 4) It doesn't match Jenkins behavior on Linux

        Attachments

          Issue Links

            Activity

            Hide
            slonopotamusorama Marat Radchenko added a comment -

            All that I described in this issue can be reproduced by running org.jenkinsci.plugins.workflow.steps.durable_task.ShellStepTest#abort test on Windows. Sometimes it quickly passes. Sometimes it idles with 20s timeout. Sometimes it fails to kill ping process.

            Show
            slonopotamusorama Marat Radchenko added a comment - All that I described in this issue can be reproduced by running org.jenkinsci.plugins.workflow.steps.durable_task.ShellStepTest#abort test on Windows. Sometimes it quickly passes. Sometimes it idles with 20s timeout. Sometimes it fails to kill ping process.
            Hide
            slonopotamusorama Marat Radchenko added a comment -

            See comments to PR#4216 for additional technical analysis.

            Show
            slonopotamusorama Marat Radchenko added a comment - See comments to PR#4216 for additional technical analysis.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            The fix was released in Jenkins 2.199

            Show
            oleg_nenashev Oleg Nenashev added a comment - The fix was released in Jenkins 2.199
            Hide
            slonopotamusorama Marat Radchenko added a comment -

            I do not agree that PR#4225 fully fixed this issue. Race conditions between multiple timers are still there. Shortening of softkill timeout makes issue less often but still possible.

            Show
            slonopotamusorama Marat Radchenko added a comment - I do not agree that PR#4225 fully fixed this issue. Race conditions between multiple timers are still there. Shortening of softkill timeout makes issue less often but still possible.
            Hide
            olivergondza Oliver Gondža added a comment -

            Given the fix is disputed (and far from trivial, IMO), I am postponing the backport to 2.190.3 at least.

            Show
            olivergondza Oliver Gondža added a comment - Given the fix is disputed (and far from trivial, IMO), I am postponing the backport to 2.190.3 at least.

              People

              • Assignee:
                Unassigned
                Reporter:
                slonopotamusorama Marat Radchenko
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: