Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26130

Print progress of pending pickles

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      After restarting Jenkins with a running flow that has some "pickled" object references (such as slave/workspace pairs from the node step), the flow does not resume until all pickles are resolved. This delay could be long, and the user may have no idea what is happening, because nothing is shown in the console.

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            The lack of this can be a serious problem, since there is just no information in either the build log or the thread dump about what is (not) happening. I am seeing random failures

            "Executing resumeTwice(com.cloudbees.workflow.cps.checkpoint.CheckpointTest)" #1 … in Object.wait() […]
               java.lang.Thread.State: WAITING (on object monitor)
            	at java.lang.Object.wait(Native Method)
            	at java.lang.Object.wait(Object.java:502)
            	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
            	- locked <0x…> (a hudson.model.queue.FutureImpl)
            	at java_util_concurrent_Future$get.call(Unknown Source)
            	at com.cloudbees.workflow.cps.checkpoint.CheckpointTest$_resumeTwice_closure5.doCall(CheckpointTest.groovy:181)
            

            with no diagnostic information available.

            Show
            jglick Jesse Glick added a comment - The lack of this can be a serious problem, since there is just no information in either the build log or the thread dump about what is (not) happening. I am seeing random failures "Executing resumeTwice(com.cloudbees.workflow.cps.checkpoint.CheckpointTest)" #1 … in Object.wait() […] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) - locked <0x…> (a hudson.model.queue.FutureImpl) at java_util_concurrent_Future$get.call(Unknown Source) at com.cloudbees.workflow.cps.checkpoint.CheckpointTest$_resumeTwice_closure5.doCall(CheckpointTest.groovy:181) with no diagnostic information available.
            Hide
            jglick Jesse Glick added a comment -

            Need to test the common scenario that a node block cannot resume because its slave was ephemeral. Can we recover with a hard kill?

            Show
            jglick Jesse Glick added a comment - Need to test the common scenario that a node block cannot resume because its slave was ephemeral. Can we recover with a hard kill?
            Hide
            jglick Jesse Glick added a comment -

            A hard kill from JENKINS-25550 works around it. Still, should behave better: should show what it is trying to resume, and a regular interruption should stop that. Currently you are given no information, and have to escalate from a regular interrupt, to a step termination, to a hard kill, and then also separately kill the unschedulable queue item.

            By the design of pickle resolution, we cannot really recover the build, unfortunately. I argued during the initial design of Workflow that the node step’s state should include just the slave name and workspace path, and its onResume should be responsible for trying to get that workspace back (so that this step could handle stop gracefully, for example by throwing an exception up that the script could catch and handle); but Kohsuke Kawaguchi overrode me, insisting that the serialized program state should include a representation of the FilePath, and that script execution shall not resume until that pickle is successfully dehydrated (even if there were other branches able to proceed, etc.).

            So the best we can do is display more clearly what is wrong and offer a hard kill right away. W.r.t. to the queue item, PlaceholderTask.run can tell via StepContext.isReady that it is still being unpickled, but it cannot use that to differentiate the case of a normal startup when we are waiting for a sluggish slave to come back online; it cannot even find the Run to tell whether it was already aborted, since it cannot call get yet. Probably it will need to persist a Run.externalizableId to implement run, and use that also instead of accessControlled. If the Run turns out to be finished, it could use Queue.getItems(Task) to cancel itself, so that cleanup from the whole process reduces to pressing the stop button once on the console page or on the flyweight executor in the executor widget.

            Show
            jglick Jesse Glick added a comment - A hard kill from JENKINS-25550 works around it. Still, should behave better: should show what it is trying to resume, and a regular interruption should stop that. Currently you are given no information, and have to escalate from a regular interrupt, to a step termination, to a hard kill, and then also separately kill the unschedulable queue item. By the design of pickle resolution, we cannot really recover the build, unfortunately. I argued during the initial design of Workflow that the node step’s state should include just the slave name and workspace path, and its onResume should be responsible for trying to get that workspace back (so that this step could handle stop gracefully, for example by throwing an exception up that the script could catch and handle); but Kohsuke Kawaguchi overrode me, insisting that the serialized program state should include a representation of the FilePath , and that script execution shall not resume until that pickle is successfully dehydrated (even if there were other branches able to proceed, etc.). So the best we can do is display more clearly what is wrong and offer a hard kill right away. W.r.t. to the queue item, PlaceholderTask.run can tell via StepContext.isReady that it is still being unpickled, but it cannot use that to differentiate the case of a normal startup when we are waiting for a sluggish slave to come back online; it cannot even find the Run to tell whether it was already aborted, since it cannot call get yet. Probably it will need to persist a Run.externalizableId to implement run , and use that also instead of accessControlled . If the Run turns out to be finished, it could use Queue.getItems(Task) to cancel itself, so that cleanup from the whole process reduces to pressing the stop button once on the console page or on the flyweight executor in the executor widget.
            Hide
            jglick Jesse Glick added a comment -

            Would be possible to differentiate between an EphemeralNode and a slave which is simply slow to come back online by checking whether the node is defined while it is offline—if so, continue to wait, if not, just abort right away. A slave using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be killed. I suppose in that case it will be removed after a few minutes and the pickle can abort itself.

            Show
            jglick Jesse Glick added a comment - Would be possible to differentiate between an EphemeralNode and a slave which is simply slow to come back online by checking whether the node is defined while it is offline—if so, continue to wait, if not, just abort right away. A slave using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be killed. I suppose in that case it will be removed after a few minutes and the pickle can abort itself.
            Hide
            rsandell rsandell added a comment -

            Interesting behaviour in 1.13

            After a restart it does now report that it can't reconnect to the node. So I aborted and after doing a Hard Kill it did abort, but it left some fragment in the Build Queue after it had "successfully" aborted and completed.
            {{
            Aborted by anonymous
            Resuming build
            [ath-oss-split-1] Could not connect to 4fd7ec39 to send interrupt signal to process
            Aborted by Robert Sandell
            Click here to forcibly terminate running steps
            Click here to forcibly kill entire build
            Hard kill!
            Finished: ABORTED}}

            And after this the (or some other fragment of) item was still in the queue.

            Show
            rsandell rsandell added a comment - Interesting behaviour in 1.13 After a restart it does now report that it can't reconnect to the node. So I aborted and after doing a Hard Kill it did abort, but it left some fragment in the Build Queue after it had "successfully" aborted and completed. {{ Aborted by anonymous Resuming build [ath-oss-split-1] Could not connect to 4fd7ec39 to send interrupt signal to process Aborted by Robert Sandell Click here to forcibly terminate running steps Click here to forcibly kill entire build Hard kill! Finished: ABORTED}} And after this the (or some other fragment of) item was still in the queue.
            Hide
            jglick Jesse Glick added a comment -

            Have a rudimentary implementation. Have not yet tried to implement cancellability of pickle dehydration.

            Show
            jglick Jesse Glick added a comment - Have a rudimentary implementation. Have not yet tried to implement cancellability of pickle dehydration.
            Hide
            jglick Jesse Glick added a comment -

            Implementation complete, except for the last suggestion to automatically abort an ExecutorPickle which is determined to be unloadable due to a deleted ephemeral node. This could be done later if desired. In the meantime, it is now much easier to identify and cancel builds affected by such issues.

            Show
            jglick Jesse Glick added a comment - Implementation complete, except for the last suggestion to automatically abort an ExecutorPickle which is determined to be unloadable due to a deleted ephemeral node. This could be done later if desired. In the meantime, it is now much easier to identify and cancel builds affected by such issues.
            Hide
            jglick Jesse Glick added a comment -

            Released as five plugin updates.

            Show
            jglick Jesse Glick added a comment - Released as five plugin updates.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java
            src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java
            http://jenkins-ci.org/commit/workflow-cps-plugin/cbd00d462ebfb8d320e87228f88baf6d6f0f90f3
            Log:
            JENKINS-26130 Way to print progress from pickles.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java http://jenkins-ci.org/commit/workflow-cps-plugin/cbd00d462ebfb8d320e87228f88baf6d6f0f90f3 Log: JENKINS-26130 Way to print progress from pickles.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java
            src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java
            http://jenkins-ci.org/commit/workflow-cps-plugin/d85f3c2d60f7d4a7f0f4e2687f6e31069b6d0f28
            Log:
            Merge pull request #5 from jglick/PPPP-JENKINS-26130

            JENKINS-26130 Way to print progress from pickles

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java http://jenkins-ci.org/commit/workflow-cps-plugin/d85f3c2d60f7d4a7f0f4e2687f6e31069b6d0f28 Log: Merge pull request #5 from jglick/PPPP- JENKINS-26130 JENKINS-26130 Way to print progress from pickles
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java
            http://jenkins-ci.org/commit/workflow-cps-plugin/3705e3f13c84ae5a29a1cba0884e5e67f54e7caa
            Log:
            JENKINS-26130 JENKINS-31842 Request that pickle futures be printable.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java http://jenkins-ci.org/commit/workflow-cps-plugin/3705e3f13c84ae5a29a1cba0884e5e67f54e7caa Log: JENKINS-26130 JENKINS-31842 Request that pickle futures be printable.

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                jglick Jesse Glick
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: