Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39072

timeout step should include more logging/diagnostics information

    Details

    • Similar Issues:

      Description

      While trying to diagnose the failure of the timeout step to terminate a process (see process tree, the Pipeline was awaiting completion of pid 1031) with Jesse Glick we determined that the timeout step doesn't include sufficient information in the logs of the Pipeline Thread Dump to be terrifically useful when things go awry.

      See the following chat log:

      16:39 < jglick> rtyler: so that was a normal termination via `SIGTERM` (143 = 128 + 15; cf. `kill -l`). Not sure why `timeout` did not do it. No Jenkins restart that I can see. Could try to add more status information to `timeout` indicating in virtual thread dump (a) whether it ever delivered a cancellation, (b) whether its scheduled task is still there.
      16:40 <@rtyler> jglick: should adding more verbiage to timeout be among the tickets I should file/
      16:40 < jglick> rtyler: that would be in `workflow-basic-steps-plugin`; `sh` is in `workflow-durable-task-step-plugin`
      16:41 < jglick> rtyler: verbiage in the log, but also information in virtual thread dump (currently it does not override `getStatus`)
      

      The more information the timeout step can include in both the Console Output and the Thread Dump, the better.

      Potentially related to JENKINS-34637

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            As part of JENKINS-34637 I am adding a message to the log, but something in getStatus is also in order.

            And I suppose if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort.

            Independently, it might be appropriate for DurableTaskStep.Execution.stop to forcibly end after a (say) 10s grace period.

            Show
            jglick Jesse Glick added a comment - As part of JENKINS-34637 I am adding a message to the log, but something in getStatus is also in order. And I suppose if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort. Independently, it might be appropriate for DurableTaskStep.Execution.stop to forcibly end after a (say) 10s grace period.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java
            http://jenkins-ci.org/commit/workflow-basic-steps-plugin/f36a5c382fbdfc7bc2b0d98430f34ace03375b40
            Log:
            JENKINS-39072 Print log messages about what is happening with timeouts.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java http://jenkins-ci.org/commit/workflow-basic-steps-plugin/f36a5c382fbdfc7bc2b0d98430f34ace03375b40 Log: JENKINS-39072 Print log messages about what is happening with timeouts.
            Hide
            jglick Jesse Glick added a comment -

            if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort.

            Unfortunately this does not seem to be possible in general. The fix of JENKINS-25504 means that calling getContext().onFailure(…) from a block-scoped StepExecution does not immediately terminate the block; it only sends another interruption signal to the innermost step, which is pointless in this case since the innermost step already ignored an earlier interrupt and will probably ignore this one too. Experimentally commenting out some part of that fix in CpsStepContext.scheduleNextRun does not help: we just get out-of-order nodes again, and anyway the body keeps running.

            In case the problem is just an unresponsive StepExecution.stop, I think we can kill it. There will be other cases that are not handled.

            Show
            jglick Jesse Glick added a comment - if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort. Unfortunately this does not seem to be possible in general. The fix of JENKINS-25504 means that calling getContext().onFailure(…) from a block-scoped StepExecution does not immediately terminate the block; it only sends another interruption signal to the innermost step, which is pointless in this case since the innermost step already ignored an earlier interrupt and will probably ignore this one too. Experimentally commenting out some part of that fix in CpsStepContext.scheduleNextRun does not help: we just get out-of-order nodes again, and anyway the body keeps running. In case the problem is just an unresponsive StepExecution.stop , I think we can kill it. There will be other cases that are not handled.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java
            src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java
            http://jenkins-ci.org/commit/workflow-basic-steps-plugin/040ed79ce10f7fc7b1c93562d4211db7e470486f
            Log:
            JENKINS-39072 In case a timeout fails to kill its block normally, try forcibly killing the innermost step, akin to doTerm.
            Does not suffice to kill the body in all cases.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java http://jenkins-ci.org/commit/workflow-basic-steps-plugin/040ed79ce10f7fc7b1c93562d4211db7e470486f Log: JENKINS-39072 In case a timeout fails to kill its block normally, try forcibly killing the innermost step, akin to doTerm. Does not suffice to kill the body in all cases.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java
            http://jenkins-ci.org/commit/workflow-basic-steps-plugin/83dfe040379f49806b6adc19b2bd11ad396ce1c2
            Log:
            [FIXED JENKINS-39072] Include status information for a running timeout step.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java http://jenkins-ci.org/commit/workflow-basic-steps-plugin/83dfe040379f49806b6adc19b2bd11ad396ce1c2 Log: [FIXED JENKINS-39072] Include status information for a running timeout step.

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                rtyler R. Tyler Croy
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: