Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-33761

Ability to disable Pipeline durability and "resume" build.

    XMLWordPrintable

    Details

    • Sprint:
      Blue Ocean 1.4 - beta 2, Pipeline - December
    • Similar Issues:

      Description

      Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing down. 

      It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.

      Implementation notes:

      • Requires a new OptionalJobProperty on the job, optionally a new BranchProperty in workflow-multibranch-plugin that echoes that same property
      • Needs some way to signal to storage (workflow-support) and execution (workflow-cps) that the pipeline is running with resume OFF to hint that they can use faster nondurable execution.

        Attachments

          Issue Links

            Activity

            jtilander Jim Tilander created issue -
            Hide
            jglick Jesse Glick added a comment -

            each time a job tries to resume it hangs infinitely

            If you can describe how to reproduce from scratch, I will try to fix it. Surviving Jenkins restarts is a key feature of Pipeline.

            Show
            jglick Jesse Glick added a comment - each time a job tries to resume it hangs infinitely If you can describe how to reproduce from scratch, I will try to fix it. Surviving Jenkins restarts is a key feature of Pipeline.
            jglick Jesse Glick made changes -
            Field Original Value New Value
            Epic Link JENKINS-35399 [ 171192 ]
            Hide
            dodoent Nenad Miksa added a comment -

            Well, in my case I have 7 parallel tasks running heavy shell scripts (scripts run cmake builds and ctest tests). Resuming build after restart starts the scripts from the beginning (which is basically the same as starting the whole job anew) and the worst part is that although 7 parallel branches are being executed, all executors are free, thus making possible for Jenkins to trigger another build and hog up the server resources.

            Show
            dodoent Nenad Miksa added a comment - Well, in my case I have 7 parallel tasks running heavy shell scripts (scripts run cmake builds and ctest tests). Resuming build after restart starts the scripts from the beginning (which is basically the same as starting the whole job anew) and the worst part is that although 7 parallel branches are being executed, all executors are free, thus making possible for Jenkins to trigger another build and hog up the server resources.
            larsmeynberg Lars Meynberg made changes -
            Link This issue is related to JENKINS-28183 [ JENKINS-28183 ]
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 169740 ] JNJira + In-Review [ 183632 ]
            abayer Andrew Bayer made changes -
            Component/s pipeline-general [ 21692 ]
            abayer Andrew Bayer made changes -
            Component/s workflow-plugin [ 18820 ]
            Hide
            jglick Jesse Glick added a comment -

            Resuming build after restart starts the scripts from the beginning

            Never heard of such a bug and cannot even imagine how it could occur. If you have steps to reproduce from scratch, please file separately.

            Show
            jglick Jesse Glick added a comment - Resuming build after restart starts the scripts from the beginning Never heard of such a bug and cannot even imagine how it could occur. If you have steps to reproduce from scratch, please file separately.
            jglick Jesse Glick made changes -
            Component/s workflow-job-plugin [ 21716 ]
            Component/s pipeline [ 21692 ]
            Hide
            davidkarlsen davidkarlsen added a comment - - edited

            Got same problem.

            Resuming build at Fri Sep 30 16:18:26 CEST 2016 after Jenkins restart
            Ready to run at Fri Sep 30 16:18:28 CEST 2016
            

            and then it just sits there.

            threaddump:

            Thread #6
            	at DSL.emailext(not yet scheduled)
            	at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:101)
            	at DSL.sshagent(Native Method)
            	at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:28)
            	at DSL.ws(Native Method)
            	at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:27)
            	at DSL.node(running on slave24-rhel)
            	at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:25)
            	at WorkflowScript.run(WorkflowScript:15)
            

            seems to be stuck in emailext

            Show
            davidkarlsen davidkarlsen added a comment - - edited Got same problem. Resuming build at Fri Sep 30 16:18:26 CEST 2016 after Jenkins restart Ready to run at Fri Sep 30 16:18:28 CEST 2016 and then it just sits there. threaddump: Thread #6 at DSL.emailext(not yet scheduled) at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:101) at DSL.sshagent(Native Method) at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:28) at DSL.ws(Native Method) at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:27) at DSL.node(running on slave24-rhel) at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:25) at WorkflowScript.run(WorkflowScript:15) seems to be stuck in emailext
            Hide
            groves Charlie Groves added a comment -

            I would also really appreciate the ability to disable resumption. We have a few builds where it doesn't make sense to resume them, so it'd be better to have it off completely.

            Show
            groves Charlie Groves added a comment - I would also really appreciate the ability to disable resumption. We have a few builds where it doesn't make sense to resume them, so it'd be better to have it off completely.
            Hide
            jglick Jesse Glick added a comment -

            a few builds where it doesn't make sense to resume them

            Because they intrinsically could not be resumed? Or you just do not really care about loss of a build or two?

            Show
            jglick Jesse Glick added a comment - a few builds where it doesn't make sense to resume them Because they intrinsically could not be resumed? Or you just do not really care about loss of a build or two?
            Hide
            groves Charlie Groves added a comment -

            There are a couple cases:

            1. We run builds on ephemeral EC2 agents. If Jenkins is restarted, the agents are often dead by the time Jenkins is back. Those builds just hang looking for the agent.
            2. We run deploys that require someone to monitor them. We'd prefer that they not be restarted automatically, that instead someone be there to start and watch them.
            Show
            groves Charlie Groves added a comment - There are a couple cases: We run builds on ephemeral EC2 agents. If Jenkins is restarted, the agents are often dead by the time Jenkins is back. Those builds just hang looking for the agent. We run deploys that require someone to monitor them. We'd prefer that they not be restarted automatically, that instead someone be there to start and watch them.
            Hide
            jglick Jesse Glick added a comment -

            We run builds on ephemeral EC2 agents. If Jenkins is restarted, the agents are often dead by the time Jenkins is back.

            If true, that is a bug in the EC2 plugin. It is supposed to keep the agent connected for the entire duration of the build.

            they not be restarted automatically

            Pipeline does not restart any build steps when Jenkins is restarted. It simply lets the existing process continue running and displaying output (or it might have ended on its own during the Jenkins restart).

            Show
            jglick Jesse Glick added a comment - We run builds on ephemeral EC2 agents. If Jenkins is restarted, the agents are often dead by the time Jenkins is back. If true, that is a bug in the EC2 plugin. It is supposed to keep the agent connected for the entire duration of the build. they not be restarted automatically Pipeline does not restart any build steps when Jenkins is restarted. It simply lets the existing process continue running and displaying output (or it might have ended on its own during the Jenkins restart).
            Hide
            rg Russell Gallop added a comment - - edited

            I'd be happy with resume either working or there being an option to disable it.

            > I've also never ever seen it work correctly...

            Likewise but maybe it happens without my noticing

            > Pipeline does not restart any build steps when Jenkins is restarted. It simply lets the existing process continue running and displaying output (or it might have ended on its own during the Jenkins restart).

            In the cases where we see resume hanging there is no process running so maybe the problem is with finding the process again, or handling it not being there.

            Should it handle bat() and sh()? Is the assumption that the slave keeps track of the process output, return value etc.? Will a job always resume to the same slave?

            Show
            rg Russell Gallop added a comment - - edited I'd be happy with resume either working or there being an option to disable it. > I've also never ever seen it work correctly... Likewise but maybe it happens without my noticing > Pipeline does not restart any build steps when Jenkins is restarted. It simply lets the existing process continue running and displaying output (or it might have ended on its own during the Jenkins restart). In the cases where we see resume hanging there is no process running so maybe the problem is with finding the process again, or handling it not being there. Should it handle bat() and sh()? Is the assumption that the slave keeps track of the process output, return value etc.? Will a job always resume to the same slave?
            Hide
            groves Charlie Groves added a comment -

            If true, that is a bug in the EC2 plugin. It is supposed to keep the agent connected for the entire duration of the build.

            Maybe the build finished before Jenkins came back? It doesn't really change that we'd prefer not to resume these builds.

            Pipeline does not restart any build steps when Jenkins is restarted. It simply lets the existing process continue running and displaying output (or it might have ended on its own during the Jenkins restart).

            Oh, it was my understanding that it'd continue to the next step after that? Does it only complete the current step and stop?

            Show
            groves Charlie Groves added a comment - If true, that is a bug in the EC2 plugin. It is supposed to keep the agent connected for the entire duration of the build. Maybe the build finished before Jenkins came back? It doesn't really change that we'd prefer not to resume these builds. Pipeline does not restart any build steps when Jenkins is restarted. It simply lets the existing process continue running and displaying output (or it might have ended on its own during the Jenkins restart). Oh, it was my understanding that it'd continue to the next step after that? Does it only complete the current step and stop?
            Hide
            jglick Jesse Glick added a comment -

            maybe the problem is with finding the process again, or handling it not being there

            Perhaps.

            Should it handle bat() and sh()?

            Yes, this is the principle use case.

            Is the assumption that the slave keeps track of the process output, return value etc.?

            Yes.

            Will a job always resume to the same slave?

            Yes.

            Maybe the build finished before Jenkins came back?

            The external process I suppose you mean. Should be fine, the sh/bat step should then simply print any final output, and exit according to the process’ exit code.

            it'd continue to the next step after that?

            Exactly.

            Show
            jglick Jesse Glick added a comment - maybe the problem is with finding the process again, or handling it not being there Perhaps. Should it handle bat() and sh()? Yes, this is the principle use case. Is the assumption that the slave keeps track of the process output, return value etc.? Yes. Will a job always resume to the same slave? Yes. Maybe the build finished before Jenkins came back? The external process I suppose you mean. Should be fine, the sh / bat step should then simply print any final output, and exit according to the process’ exit code. it'd continue to the next step after that? Exactly.
            Hide
            groves Charlie Groves added a comment -

            Ahh, well what we want is for it not to continue without someone being there to monitor it. Automatically continuing to the next step means that if the person previously monitoring it left after Jenkins went down, they wouldn't be there when it comes back. We'd like to not have it resume for those builds.

            Show
            groves Charlie Groves added a comment - Ahh, well what we want is for it not to continue without someone being there to monitor it. Automatically continuing to the next step means that if the person previously monitoring it left after Jenkins went down, they wouldn't be there when it comes back. We'd like to not have it resume for those builds.
            Hide
            jglick Jesse Glick added a comment -

            I think what you are looking for is an input step between stages. Whether Jenkins happened to be restarted in the middle of the build is not relevant to that.

            Show
            jglick Jesse Glick added a comment - I think what you are looking for is an input step between stages. Whether Jenkins happened to be restarted in the middle of the build is not relevant to that.
            Hide
            groves Charlie Groves added a comment -

            Not really. While the build is running normally, we don't want anyone to have to confirm proceeding to the next step.

            It's only when the build is interrupted by a restart that we don't want it to automatically continue. Whenever Jenkins goes down, it's generally a larger failure with a variable amount of time before service is restored. We can't guarantee that the person watching it will be there when Jenkins comes back without making the normal case worse.

            Show
            groves Charlie Groves added a comment - Not really. While the build is running normally, we don't want anyone to have to confirm proceeding to the next step. It's only when the build is interrupted by a restart that we don't want it to automatically continue. Whenever Jenkins goes down, it's generally a larger failure with a variable amount of time before service is restored. We can't guarantee that the person watching it will be there when Jenkins comes back without making the normal case worse.
            Hide
            dodoent Nenad Miksa added a comment - - edited

            Jesse Glick, I am aware that resuming build is one of the core features of pipeline and that you would very much like it to work by default, however from my experience most of the plugins do not properly support resuming the build (i.e. they have bugs, not that they do not deliberately support it). After restarting the Jenkins (mostly due to plugin updates), I've seen jobs resuming without taking any executors, jobs taking an executor and just waiting something to happen (which never happens, i.e. waiting is infinite), jobs which do the same thing (that was already done) after being resumed, plugins failing to parse XML test reports after being "resumed" in the middle of the process, jobs that were restarted during perfoming network operations inside shell script which were frozen after resume and could not be stopped in any way (neither with stop nor with kill - we had to manually remove the job from database while jenkins was offline to stop this job from resuming), ...

            Please be aware that jenkins is also used by software developers that do not develop in Java (which is natively supported by jenkins) and that we do some very weird things in our build scripts to support behaviour and flexibility we need (for example, in my case I need to clone repository on SSD, while one of its submodules must be cloned on rotational disk - such a use case will never be supported by default jenkins scm plugins and I must therefore write my on build script which does that, and improper/buggy resuming of such a script usually makes the executor wait indefinitely for something which never happens, so I must manually kill the build (yes, kill, because stop is also ignored)).

            For such proper supporting such use cases, there are two ways - either add support in every plugin for every use case (no matter how weird it is) and make it correctly work in all cases, including bug-free resuming build on multi-node heterogenous system - which is nearly impossible, or simply add this simple checkbox saying "disable resuming of build" which will either prevent jenkins to be restarted while build is ongoing (behaviour of freestyle job), or simply fail the build. Yes, failing the build is not technically correct, but it is exactly what is currently happening for us, except after resuming, a engineer needs to log into the jenkins and manually kill the zombie build which only waits and never properly resumes.

            Show
            dodoent Nenad Miksa added a comment - - edited Jesse Glick , I am aware that resuming build is one of the core features of pipeline and that you would very much like it to work by default, however from my experience most of the plugins do not properly support resuming the build (i.e. they have bugs, not that they do not deliberately support it). After restarting the Jenkins (mostly due to plugin updates), I've seen jobs resuming without taking any executors, jobs taking an executor and just waiting something to happen (which never happens, i.e. waiting is infinite), jobs which do the same thing (that was already done) after being resumed, plugins failing to parse XML test reports after being "resumed" in the middle of the process, jobs that were restarted during perfoming network operations inside shell script which were frozen after resume and could not be stopped in any way (neither with stop nor with kill - we had to manually remove the job from database while jenkins was offline to stop this job from resuming), ... Please be aware that jenkins is also used by software developers that do not develop in Java (which is natively supported by jenkins) and that we do some very weird things in our build scripts to support behaviour and flexibility we need (for example, in my case I need to clone repository on SSD, while one of its submodules must be cloned on rotational disk - such a use case will never be supported by default jenkins scm plugins and I must therefore write my on build script which does that, and improper/buggy resuming of such a script usually makes the executor wait indefinitely for something which never happens, so I must manually kill the build (yes, kill, because stop is also ignored)). For such proper supporting such use cases, there are two ways - either add support in every plugin for every use case (no matter how weird it is) and make it correctly work in all cases, including bug-free resuming build on multi-node heterogenous system - which is nearly impossible, or simply add this simple checkbox saying "disable resuming of build" which will either prevent jenkins to be restarted while build is ongoing (behaviour of freestyle job), or simply fail the build. Yes, failing the build is not technically correct, but it is exactly what is currently happening for us, except after resuming, a engineer needs to log into the jenkins and manually kill the zombie build which only waits and never properly resumes.
            jglick Jesse Glick made changes -
            Component/s workflow-cps-plugin [ 21713 ]
            jglick Jesse Glick made changes -
            Issue Type Improvement [ 4 ] New Feature [ 2 ]
            Hide
            jglick Jesse Glick added a comment -

            jobs that were restarted during perfoming network operations inside shell script which were frozen after resume and could not be stopped in any way

            If this is reproducible somehow I would consider it a high-priority bug to be fixed. Ditto for the other scenarios, unless they are limited to usage of some particular plugin.

            simply add this simple checkbox

            Adding a checkbox is of course simple. Making it do what you request is not necessarily simple and would require significant study. CpsThreadGroup.saveProgram can easily be suppressed (possibly the CPS transform could even be disabled), and WorkflowRun.onLoad could easily be made to fail a build which had not terminated cleanly, and CpsFlowExecution.blocksRestart could be made to unconditionally return true. But then you will still get unreleased workspace locks and the like. Possibly there is some way a Terminator could throw ThreadDeath into the Groovy call stack to try to unwind blocks cleanly.

            Show
            jglick Jesse Glick added a comment - jobs that were restarted during perfoming network operations inside shell script which were frozen after resume and could not be stopped in any way If this is reproducible somehow I would consider it a high-priority bug to be fixed. Ditto for the other scenarios, unless they are limited to usage of some particular plugin. simply add this simple checkbox Adding a checkbox is of course simple. Making it do what you request is not necessarily simple and would require significant study. CpsThreadGroup.saveProgram can easily be suppressed ( possibly the CPS transform could even be disabled), and WorkflowRun.onLoad could easily be made to fail a build which had not terminated cleanly, and CpsFlowExecution.blocksRestart could be made to unconditionally return true. But then you will still get unreleased workspace locks and the like. Possibly there is some way a Terminator could throw ThreadDeath into the Groovy call stack to try to unwind blocks cleanly.
            Hide
            dodoent Nenad Miksa added a comment - - edited

            Jesse Glick, unfortunately the bug is not deterministic - usually after restart jobs just hang, but can be killed (with kill of course, rarely works with stop).

            The shell script which hanged in a way that after resume job could not be killed at all is this:

                    sh "curl -s -X POST https://bitbucket.org/site/oauth2/access_token -u \"${getBitbucketOAuthKey()}:${getBitbucketOAuthSecret()}\" -d grant_type=client_credentials | jsawk 'return this.access_token' | tr -d \"\\n\" > accessToken.txt"
            

            Script obtains access token for BitBucket API so it can be used later for notifying commit statuses and approving pull request - something that bitbucket branch source plugin does not support (later they added support for that, but it is not configurable to give flexibility we need). I cannot guarantee you that this will trigger the bug, since this shell script is executed for every build we have and only 3 jobs (out of dozens daily) have been executing this script at the time of jenkins restart, which caused them to lock in a way that even kill didn't work.

            However, it would be better to just fix resuming of jobs - since the original bug report, I have seen much improvements in this field (unix shell scripts now rarely hang after resume, but windows batch script almost always do). As I said, the most problematic are windows batch scripts running cmake-base build of visual studio c++ projects (cmake is used to create visual studio solution and then 'cmake --build . --config Release .' is used to invoke MSBuild builder to build the project). When restart is triggered (on master node, which is linux box) while this build is executing on windows slave, first this batch script is terminated (I guess with some kind of interrupt signal) which causes MSVC to report build as failed (MSVC reports cancelled builds as failures) and after restart this batch script is resumed, but instead of either new build with MSBuild or new call to entire batch script (which should build the project correctly) or continuing with next batch script which is followed after the one which performs the build (which actually collects test results and stashes them so later master node can utilize XUnit publisher plugin to publish test results), the job simply hangs and does nothing indefinitely (until someone logs in and kills it, because stop command is also ignored).

            Show
            dodoent Nenad Miksa added a comment - - edited Jesse Glick , unfortunately the bug is not deterministic - usually after restart jobs just hang, but can be killed (with kill of course, rarely works with stop). The shell script which hanged in a way that after resume job could not be killed at all is this: sh "curl -s -X POST https: //bitbucket.org/site/oauth2/access_token -u \" ${getBitbucketOAuthKey()}:${getBitbucketOAuthSecret()}\ " -d grant_type=client_credentials | jsawk ' return this .access_token' | tr -d \" \\n\ " > accessToken.txt" Script obtains access token for BitBucket API so it can be used later for notifying commit statuses and approving pull request - something that bitbucket branch source plugin does not support (later they added support for that, but it is not configurable to give flexibility we need). I cannot guarantee you that this will trigger the bug, since this shell script is executed for every build we have and only 3 jobs (out of dozens daily) have been executing this script at the time of jenkins restart, which caused them to lock in a way that even kill didn't work. However, it would be better to just fix resuming of jobs - since the original bug report, I have seen much improvements in this field (unix shell scripts now rarely hang after resume, but windows batch script almost always do). As I said, the most problematic are windows batch scripts running cmake-base build of visual studio c++ projects (cmake is used to create visual studio solution and then 'cmake --build . --config Release .' is used to invoke MSBuild builder to build the project). When restart is triggered (on master node, which is linux box) while this build is executing on windows slave, first this batch script is terminated (I guess with some kind of interrupt signal) which causes MSVC to report build as failed (MSVC reports cancelled builds as failures) and after restart this batch script is resumed, but instead of either new build with MSBuild or new call to entire batch script (which should build the project correctly) or continuing with next batch script which is followed after the one which performs the build (which actually collects test results and stashes them so later master node can utilize XUnit publisher plugin to publish test results), the job simply hangs and does nothing indefinitely (until someone logs in and kills it, because stop command is also ignored).
            Hide
            matthall Matthew Hall added a comment -

            Hello, I have recently also come across the bug of jobs not restarting, I can also provide a testcase to help with investigation, three jobs are required:

            Job 1 will trigger job_40_sec and job_50_sec in parallel

            If jenkins restarts or is killed when job_40_sec and job_50_sec are both running, then, when Jenkins comes back online only one of the jobs is restarted whilst the other hangs indefinitely

            Please let me know if you need any more information or if this is the wrong place for this information

            Pipeline scripts:

            Job 1

            Map parallel_jobs = ['branch_1': {build job: 'job_50_sec'},
                                 'branch_2': {build job: 'job_40_sec'}]
            parallel parallel_jobs

            job_40_sec

            node { sleep(40) }

            job_50_sec

            node { sleep(50) }
            Show
            matthall Matthew Hall added a comment - Hello, I have recently also come across the bug of jobs not restarting, I can also provide a testcase to help with investigation, three jobs are required: Job 1 will trigger job_40_sec and job_50_sec in parallel If jenkins restarts or is killed when job_40_sec and job_50_sec are both running, then, when Jenkins comes back online only one of the jobs is restarted whilst the other hangs indefinitely Please let me know if you need any more information or if this is the wrong place for this information Pipeline scripts: Job 1 Map parallel_jobs = [ 'branch_1' : {build job: 'job_50_sec' }, 'branch_2' : {build job: 'job_40_sec' }] parallel parallel_jobs job_40_sec node { sleep(40) } job_50_sec node { sleep(50) }
            Hide
            rn R N added a comment -

            WRT JENKINS-41916, It would be good if the Resume build option is disabled as it doesn't respect Security.

            Show
            rn R N added a comment - WRT JENKINS-41916, It would be good if the Resume build option is disabled as it doesn't respect Security.
            Hide
            jglick Jesse Glick added a comment -

            FTR

            I have recently also come across the bug of jobs not restarting

            This was filed separately and a fix released.

            Show
            jglick Jesse Glick added a comment - FTR I have recently also come across the bug of jobs not restarting This was filed separately and a fix released.
            jglick Jesse Glick made changes -
            Link This issue is duplicated by JENKINS-37475 [ JENKINS-37475 ]
            Hide
            khalilj Khalil Jiries added a comment -

            Hi,

             

            +1 for an option to disable pipeline resumption.

            Thanks!

            Show
            khalilj Khalil Jiries added a comment - Hi,   +1 for an option to disable pipeline resumption. Thanks!
            jglick Jesse Glick made changes -
            Link This issue relates to JENKINS-36013 [ JENKINS-36013 ]
            Hide
            jeanmertz Jean Mertz added a comment -

            We use the jenkins-kubernetes plugin and it also does not resume as expected.

            Even if it did, we strive to have jobs take less than a couple of minutes, so we don't care if jobs don't resume after a restart. Having the ability to chose between "resume job", "restart job" or "discard job" would be a nice feature. We'd probably use the "restart" functionality, although I can also see use-cases for discarding the job in its entirety.

            Show
            jeanmertz Jean Mertz added a comment - We use the jenkins-kubernetes plugin and it also does not resume as expected. Even if it did, we strive to have jobs take less than a couple of minutes, so we don't care if jobs don't resume after a restart. Having the ability to chose between "resume job", "restart job" or "discard job" would be a nice feature. We'd probably use the "restart" functionality, although I can also see use-cases for discarding the job in its entirety.
            Hide
            shahmishal mishal shah added a comment - - edited

            We should be able to disable auto resume jobs on Jenkins, it causes lots of jobs to hang for hrs before having to manually kill the builds. Also, sometime we don't get any notification about by being stuck for days because it could not resume. 

            Show
            shahmishal mishal shah added a comment - - edited We should be able to disable auto resume jobs on Jenkins, it causes lots of jobs to hang for hrs before having to manually kill the builds. Also, sometime we don't get any notification about by being stuck for days because it could not resume. 
            jayv Jo Voordeckers made changes -
            Attachment Screen Shot 2017-07-19 at 11.14.33 AM.png [ 38963 ]
            jayv Jo Voordeckers made changes -
            Attachment Screen Shot 2017-07-19 at 11.14.33 AM.png [ 38963 ]
            Hide
            maxfields2000 Maxfield Stewart added a comment - - edited

            Definitely need a resume disable capability. I'd love a global option, but a per job option is also not a crime. There are times when you know your workspace and what your job is doing will not be viable for resumption even if Jenkins can, in theory resume your build.  The resumes will just fail or worse lead to red-herring issues.  Yes, people should create stateless build steps with always clean thinking but that's just not reality at scale. Having a disable resume option would be handy.

             

            Second, if you are working in a situation where your workspace, or build nodes are ephemeral, resume literally breaks jenkins. The resume feature locks to the precise name of the comptue/build slave.  Which in a ephemeral state (think mesos/kubernetes/docker plugins, or jclouds and dynamically named slaves), when jenkins is restarted that build slave no longer exists. The job sits in the queue waiting for the slave to come online, but it never does. Because it's resume is using the slave name, not the "label" tied to the actual job, the act of being in the queue never triggers a the dynamic provisioner (jclouds) to create a new slave. And the job hangs indefinitely.

             

            We see evidence that these resume states can also cause thread locking on the build queue itself which then prevents any jobs from queuing at all.  We have to go through quite the arduous process to manually clean the jenkins file system to prevent builds from requeing.

            If you won't provide s disable resume feature, t hen at least tell us the logic for how Jenkins decides which jobs need to be resumed so we can properly clean up markers jenkins looks for on the file system to tell it to requeue.

            It seems to be a combination of jenkins home xml files as well as some file state inside the Jobs folder, (jobs/JobName/builds/#/workflow/....).  But I don't know exactly what.

             Another option could be to wrap "node" blocks with "Resume from here" rather than resume from inside a node block.  I suppose we could try putting the "node" call inside a function marked with @NonCPS but that seems extreme and may have unexpected results and I doubt I could get all my users to follow that convention anyway.

             

            Show
            maxfields2000 Maxfield Stewart added a comment - - edited Definitely need a resume disable capability. I'd love a global option, but a per job option is also not a crime. There are times when you know your workspace and what your job is doing will not be viable for resumption even if Jenkins can, in theory resume your build.  The resumes will just fail or worse lead to red-herring issues.  Yes, people should create stateless build steps with always clean thinking but that's just not reality at scale. Having a disable resume option would be handy.   Second, if you are working in a situation where your workspace, or build nodes are ephemeral, resume literally breaks jenkins. The resume feature locks to the precise name of the comptue/build slave.  Which in a ephemeral state (think mesos/kubernetes/docker plugins, or jclouds and dynamically named slaves), when jenkins is restarted that build slave no longer exists. The job sits in the queue waiting for the slave to come online, but it never does. Because it's resume is using the slave name, not the "label" tied to the actual job, the act of being in the queue never triggers a the dynamic provisioner (jclouds) to create a new slave. And the job hangs indefinitely.   We see evidence that these resume states can also cause thread locking on the build queue itself which then prevents any jobs from queuing at all.  We have to go through quite the arduous process to manually clean the jenkins file system to prevent builds from requeing. If you won't provide s disable resume feature, t hen at least tell us the logic for how Jenkins decides which jobs need to be resumed so we can properly clean up markers jenkins looks for on the file system to tell it to requeue. It seems to be a combination of jenkins home xml files as well as some file state inside the Jobs folder, (jobs/JobName/builds/#/workflow/....).  But I don't know exactly what.  Another option could be to wrap "node" blocks with "Resume from here" rather than resume from inside a node block.  I suppose we could try putting the "node" call inside a function marked with @NonCPS but that seems extreme and may have unexpected results and I doubt I could get all my users to follow that convention anyway.  
            oleg_nenashev Oleg Nenashev made changes -
            Link This issue is related to JENKINS-45917 [ JENKINS-45917 ]
            jglick Jesse Glick made changes -
            Assignee Jesse Glick [ jglick ]
            Hide
            jglick Jesse Glick added a comment -

            Please stop adding +1 comments. You may use the voting feature in JIRA.

            build nodes are ephemeral

            Already tracked as JENKINS-36013.

            Show
            jglick Jesse Glick added a comment - Please stop adding +1 comments. You may use the voting feature in JIRA. build nodes are ephemeral Already tracked as  JENKINS-36013 .
            michaelneale Michael Neale made changes -
            Assignee Sam Van Oort [ svanoort ]
            svanoort Sam Van Oort made changes -
            Summary Ability to disable "resume" build. Ability to disable Pipeline durability and "resume" build.
            svanoort Sam Van Oort made changes -
            Description Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes. I've also never ever seen it work correctly, each time a job tries to resume it hangs infinitely and I always have to go in and kill the job manually.

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.
            Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing done. 

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.
            Hide
            svanoort Sam Van Oort added a comment -

            Maxfield Stewart The resume issue with missing build agents was fixed as of workflow-durable-task-step 2.15 –  patched a week or so ago.  The request for NON-durable pipeline came up at Jenkins World too, although with a different motivation (performance). 

            I think this ties into work happening now on how we store data for pipelines and their logs that I'm launching into shortly (larger project though, will take a while to land).  

            As Jesse says, the failure to resume is generally a specific bug of some sort and needs to be addressed

            Show
            svanoort Sam Van Oort added a comment - Maxfield Stewart The resume issue with missing build agents was fixed as of workflow-durable-task-step 2.15 –  patched a week or so ago.  The request for NON-durable pipeline came up at Jenkins World too, although with a different motivation (performance).  I think this ties into work happening now on how we store data for pipelines and their logs that I'm launching into shortly (larger project though, will take a while to land).   As Jesse says, the failure to resume is generally a specific bug of some sort and needs to be addressed
            michaelneale Michael Neale made changes -
            Component/s blueocean-plugin [ 21481 ]
            michaelneale Michael Neale made changes -
            Sprint Blue Ocean 1.3 - candidates [ 326 ]
            svanoort Sam Van Oort made changes -
            Epic Link JENKINS-35399 [ 171192 ] JENKINS-47170 [ 185575 ]
            svanoort Sam Van Oort made changes -
            Component/s blueocean-plugin [ 21481 ]
            Hide
            svanoort Sam Van Oort added a comment -

            I'm attaching this to the storage epic because what I have in mind will also let you use this "unsafe" mode for gigantic performance gains. 

            Show
            svanoort Sam Van Oort added a comment - I'm attaching this to the storage epic because what I have in mind will also let you use this "unsafe" mode for gigantic performance gains. 
            svanoort Sam Van Oort made changes -
            Link This issue is related to JENKINS-47173 [ JENKINS-47173 ]
            svanoort Sam Van Oort made changes -
            Description Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing done. 

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.
            Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing done. 

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.

            Implementation notes:

            * Requires a new OptionalJobProperty on the job, plus a new BranchProperty in branch-api-plugin that echoes that
            svanoort Sam Van Oort made changes -
            Description Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing done. 

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.

            Implementation notes:

            * Requires a new OptionalJobProperty on the job, plus a new BranchProperty in branch-api-plugin that echoes that
            Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing done. 

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.

            Implementation notes:
             * Requires a new OptionalJobProperty on the job, plus a new BranchProperty in branch-api-plugin that echoes that
             * Needs some way to signal to storage (workflow-support) and execution (workflow-cps) that the pipeline is running with resume OFF to hint that they can use faster nondurable execution.
            svanoort Sam Van Oort made changes -
            Description Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing done. 

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.

            Implementation notes:
             * Requires a new OptionalJobProperty on the job, plus a new BranchProperty in branch-api-plugin that echoes that
             * Needs some way to signal to storage (workflow-support) and execution (workflow-cps) that the pipeline is running with resume OFF to hint that they can use faster nondurable execution.
            Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing done. 

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.

            Implementation notes:
             * Requires a new OptionalJobProperty on the job, optionally a new BranchProperty in workflow-multibranch-plugin that echoes that same property
             * Needs some way to signal to storage (workflow-support) and execution (workflow-cps) that the pipeline is running with resume OFF to hint that they can use faster nondurable execution.
            svanoort Sam Van Oort made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            svanoort Sam Van Oort made changes -
            Component/s workflow-api-plugin [ 21711 ]
            svanoort Sam Van Oort made changes -
            Description Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing done. 

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.

            Implementation notes:
             * Requires a new OptionalJobProperty on the job, optionally a new BranchProperty in workflow-multibranch-plugin that echoes that same property
             * Needs some way to signal to storage (workflow-support) and execution (workflow-cps) that the pipeline is running with resume OFF to hint that they can use faster nondurable execution.
            Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases.  Also, providing durability results in extensive writes to disk that can bring performance crashing down. 

            It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.

            Implementation notes:
             * Requires a new OptionalJobProperty on the job, optionally a new BranchProperty in workflow-multibranch-plugin that echoes that same property
             * Needs some way to signal to storage (workflow-support) and execution (workflow-cps) that the pipeline is running with resume OFF to hint that they can use faster nondurable execution.
            Hide
            jglick Jesse Glick added a comment -

            In such a mode you might as well also switch DurableTaskStep to be, well, nondurable, and thus to just use Launcher synchronously to run processes. Would probably require some API massaging. Would lose durability across transient agent disconnections as well as Jenkins restarts—i.e., same as traditional builds.

            Show
            jglick Jesse Glick added a comment - In such a mode you might as well also switch DurableTaskStep to be, well, nondurable, and thus to just use Launcher synchronously to run processes. Would probably require some API massaging. Would lose durability across transient agent disconnections as well as Jenkins restarts—i.e., same as traditional builds.
            jglick Jesse Glick made changes -
            Component/s workflow-durable-task-step-plugin [ 21715 ]
            svanoort Sam Van Oort made changes -
            Link This issue is related to JENKINS-47390 [ JENKINS-47390 ]
            Hide
            svanoort Sam Van Oort added a comment -

            Jesse Glick Thanks for mentioning that.  To capture an out-of-band discussion, this has value to improve reliability.  I've forked that off into JENKINS-47390  because it is a nice work item on its own with clearly defined boundaries, and to avoid expanding the scope of this item too much. 

            Show
            svanoort Sam Van Oort added a comment - Jesse Glick Thanks for mentioning that.  To capture an out-of-band discussion, this has value to improve reliability.  I've forked that off into JENKINS-47390   because it is a nice work item on its own with clearly defined boundaries, and to avoid expanding the scope of this item too much. 
            svanoort Sam Van Oort made changes -
            Component/s workflow-durable-task-step-plugin [ 21715 ]
            Hide
            sbeckwithiii Sam Beckwith III added a comment -

            I for one would very much like the option to disable durability in pipelines. I prefer to fail early and fail fast in our environment. Pipelines with the ability to resume are fantastic in many situations but resuming in our environment continues to cause issues and headaches. At this point I am looking for ways to stop using pipelines yet have the ability to dynamically choose nodes (the only reason we use pipelines at this point is for the ability to programatically choose nodes).

            It is far better for our automation to fail and retry again from the start or generate a report, therefore I heartily support the notion of making durability an option in pipelines rather than a requirement.

            Show
            sbeckwithiii Sam Beckwith III added a comment - I for one would very much like the option to disable durability in pipelines. I prefer to fail early and fail fast in our environment. Pipelines with the ability to resume are fantastic in many situations but resuming in our environment continues to cause issues and headaches. At this point I am looking for ways to stop using pipelines yet have the ability to dynamically choose nodes (the only reason we use pipelines at this point is for the ability to programatically choose nodes). It is far better for our automation to fail and retry again from the start or generate a report, therefore I heartily support the notion of making durability an option in pipelines rather than a requirement.
            Hide
            svanoort Sam Van Oort added a comment -

            Sam Beckwith III I think you'll be happy with this when it lands - I'm hoping within the next few weeks, but take that with a grain of salt because it's part of a larger effort that brings a lot of useful features to permit pipeline to run faster and reduce the load it puts on masters.

            I'm sorry you've had so many issues with resume though – if you wouldn't mind, could you tell us what problems you've had? We've done a fair bit of work recently to resolve resume-time issues so it's quite likely some of these have been fixed, and if not we'd like to ensure that is robust. Thanks!

            Show
            svanoort Sam Van Oort added a comment - Sam Beckwith III I think you'll be happy with this when it lands - I'm hoping within the next few weeks, but take that with a grain of salt because it's part of a larger effort that brings a lot of useful features to permit pipeline to run faster and reduce the load it puts on masters. I'm sorry you've had so many issues with resume though – if you wouldn't mind, could you tell us what problems you've had? We've done a fair bit of work recently to resolve resume-time issues so it's quite likely some of these have been fixed, and if not we'd like to ensure that is robust. Thanks!
            Hide
            sbeckwithiii Sam Beckwith III added a comment -

            My previous comment could be taken as a complaint or a "I expect this and deserve this" which was not my intention. We are very pleased with Jenkins as a whole. I'm surprised by how much we're able to do with it.

            Sam Van Oort, thank you for getting back. We have tasted some of the issues listed above including the zombie process that just would not die, as well as plugins that could use some improvement resulting in interesting indefinite wait situations like that above mentioned kubernetes plugin. My words were not meant to convey that the resumability feature of pipelines is useless or that y'all have been wasting time, but rather that I agree with others here that not every team or circumstance benefits from the resume feature and in some situations is actually hindered by it.

            If it is useful, I can clarify why I said, "the only reason we use pipelines at this point". Why wouldn't we joyfully jump on board with pipelines and be all in? It doesn't align with the framework we've built nor does it fit the direction we've been going (wrap up automation into nice, neat little "modules" or function calls that do not require the team to learn Jenkins, Job DSL, Pipelines or Groovy), however dynamically choosing nodes based on user input is a massive boon to us in some circumstances. Where we use pipelines we must do more work to accomplish our goals and lose almost all the features of our framework but that's a choice we make. The loss is far greater to us whenever we run into the above mentioned issues with pipelines.

            Not sure where to put this but it is worth noting that we use scripted pipelines, and do so out of necessity and design even though declarative has desirable niceties like post build steps. I didn't realize how nice it was for Jenkins to determine if a build was unstable for me until I had to write the code myself.

            Show
            sbeckwithiii Sam Beckwith III added a comment - My previous comment could be taken as a complaint or a "I expect this and deserve this" which was not my intention. We are very pleased with Jenkins as a whole. I'm surprised by how much we're able to do with it. Sam Van Oort , thank you for getting back. We have tasted some of the issues listed above including the zombie process that just would not die, as well as plugins that could use some improvement resulting in interesting indefinite wait situations like that above mentioned kubernetes plugin. My words were not meant to convey that the resumability feature of pipelines is useless or that y'all have been wasting time, but rather that I agree with others here that not every team or circumstance benefits from the resume feature and in some situations is actually hindered by it. If it is useful, I can clarify why I said, "the only reason we use pipelines at this point". Why wouldn't we joyfully jump on board with pipelines and be all in? It doesn't align with the framework we've built nor does it fit the direction we've been going (wrap up automation into nice, neat little "modules" or function calls that do not require the team to learn Jenkins, Job DSL, Pipelines or Groovy), however dynamically choosing nodes based on user input is a massive boon to us in some circumstances. Where we use pipelines we must do more work to accomplish our goals and lose almost all the features of our framework but that's a choice we make. The loss is far greater to us whenever we run into the above mentioned issues with pipelines. Not sure where to put this but it is worth noting that we use scripted pipelines, and do so out of necessity and design even though declarative has desirable niceties like post build steps. I didn't realize how nice it was for Jenkins to determine if a build was unstable for me until I had to write the code myself.
            Hide
            svanoort Sam Van Oort added a comment -

            Sam Beckwith III No worries, no offense taken at what you said – we know that pipeline isn't perfect and just want to improve it over time.

            nor does it fit the direction we've been going (wrap up automation into nice, neat little "modules" or function calls that do not require the team to learn Jenkins, Job DSL, Pipelines or Groovy),

            This is what Pipeline Shared Libraries are intended to offer – in ci.jenkins.io, building a plugin is as simple as a JenkinsFile containing nothing but "buildPlugin()" in the repo.  But I'm guessing you've invested in building a framework around specific business needs and moving over to pipeline represents a loss of that invested effort + something not as closely aligned to your specific needs?

            > We have tasted some of the issues listed above including the zombie process that just would not die

            Three key causes of this is resolved in the resolved in the most recent round of pipeline plugin updates (specifically: waiting for a throwaway node that will never reappear, waiting for a disconnected node to respond, and issues with stop operations on steps).

            So, you might find that an update to the plugins resolves the issue (if not, we'd really love to see an issue filed for it so we can put it to rest for good, because that represents a clear bug).

            But anyway, even aside from specific bugs, I think there's a recognition that automatic resume just plain may not make sense for every case... and softening that requirement for pipelines opens up a ton of opportunities.

            Show
            svanoort Sam Van Oort added a comment - Sam Beckwith III No worries, no offense taken at what you said – we know that pipeline isn't perfect and just want to improve it over time. nor does it fit the direction we've been going (wrap up automation into nice, neat little "modules" or function calls that do not require the team to learn Jenkins, Job DSL, Pipelines or Groovy), This is what Pipeline Shared Libraries are intended to offer – in ci.jenkins.io, building a plugin is as simple as a JenkinsFile containing nothing but "buildPlugin()" in the repo.  But I'm guessing you've invested in building a framework around specific business needs and moving over to pipeline represents a loss of that invested effort + something not as closely aligned to your specific needs? > We have tasted some of the issues listed above including the zombie process that just would not die Three key causes of this is resolved in the resolved in the most recent round of pipeline plugin updates (specifically: waiting for a throwaway node that will never reappear, waiting for a disconnected node to respond, and issues with stop operations on steps). So, you might find that an update to the plugins resolves the issue (if not, we'd really love to see an issue filed for it so we can put it to rest for good, because that represents a clear bug). But anyway, even aside from specific bugs, I think there's a recognition that automatic resume just plain may not make sense for every case... and softening that requirement for pipelines opens up a ton of opportunities.
            Hide
            sbeckwithiii Sam Beckwith III added a comment -

            You are very encouraging, Sam Van Oort.

            I'm looking at the following change in an effort to work around the issue we currently are facing. Thank you.

            2.14 (Aug 23, 2017)
            JENKINS-36013 - Prevent Jenkins from spinning indefinitely trying to resume a build where the Agent is an EphemeralNode and will never come back
            Also covers cases where the node was removed by RetentionPolicy because it is destroyed, by aborting after a timeout (5 minutes by default)
            This ONLY happens if the Node is removed, not for simply disconnected nodes, and only is triggered upon restart of the master
            Added System property 'org.jenkinsci.plugins.workflow.support.pickles.ExecutorPickle.timeOutForNodeMillis' for how long to wait before aborting builds

            Show
            sbeckwithiii Sam Beckwith III added a comment - You are very encouraging, Sam Van Oort . I'm looking at the following change in an effort to work around the issue we currently are facing. Thank you. 2.14 (Aug 23, 2017) JENKINS-36013 - Prevent Jenkins from spinning indefinitely trying to resume a build where the Agent is an EphemeralNode and will never come back Also covers cases where the node was removed by RetentionPolicy because it is destroyed, by aborting after a timeout (5 minutes by default) This ONLY happens if the Node is removed, not for simply disconnected nodes, and only is triggered upon restart of the master Added System property 'org.jenkinsci.plugins.workflow.support.pickles.ExecutorPickle.timeOutForNodeMillis' for how long to wait before aborting builds
            Hide
            mkozell Mike Kozell added a comment - - edited

            I would also like to see the ability to restart a Jenkins master without it restarting or resuming any pipeline builds. Before restarting Jenkins to change a startup parameter, I verified my Jenkins master server was idle with no jobs running on master or slave executors. After the restart I saw the following errors in the log file and an attempt was made to resume builds 38, 107, and 108 all which are weeks old. It appears one of these builds was originally hung on "java.lang.InterruptedException" and the other two were hung on "org.jenkinsci.plugins.workflow.steps.FlowInterruptedException". The information used in these builds is obsolete and I prefer a Jenkins master restart to not resume any builds. These builds did not resume properly and I had to force stop them anyways.

            Oct 31, 2017 10:17:51 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
            WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[JOBNAME/108:JOBNAME #108]]
            java.util.EmptyStackException
                    at java.util.Stack.peek(Stack.java:102)
                    at java.util.Stack.pop(Stack.java:84)
                    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1026)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:350)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230)
                    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
                    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)
            
            Oct 31, 2017 10:17:55 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
            WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[JOBNAME/107:JOBNAME #107]]
            java.util.EmptyStackException
                    at java.util.Stack.peek(Stack.java:102)
                    at java.util.Stack.pop(Stack.java:84)
                    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1026)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:350)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230)
                    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
                    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)
            
            Oct 31, 2017 10:18:02 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
            WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[JOBNAME/38:JOBNAME #38]]
            java.util.EmptyStackException
                    at java.util.Stack.peek(Stack.java:102)
                    at java.util.Stack.pop(Stack.java:84)
                    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1026)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:350)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242)
                    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230)
                    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
                    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)
            
            
            Show
            mkozell Mike Kozell added a comment - - edited I would also like to see the ability to restart a Jenkins master without it restarting or resuming any pipeline builds. Before restarting Jenkins to change a startup parameter, I verified my Jenkins master server was idle with no jobs running on master or slave executors. After the restart I saw the following errors in the log file and an attempt was made to resume builds 38, 107, and 108 all which are weeks old. It appears one of these builds was originally hung on "java.lang.InterruptedException" and the other two were hung on "org.jenkinsci.plugins.workflow.steps.FlowInterruptedException". The information used in these builds is obsolete and I prefer a Jenkins master restart to not resume any builds. These builds did not resume properly and I had to force stop them anyways. Oct 31, 2017 10:17:51 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[JOBNAME/108:JOBNAME #108]] java.util.EmptyStackException at java.util.Stack.peek(Stack.java:102) at java.util.Stack.pop(Stack.java:84) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1026) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:350) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 31, 2017 10:17:55 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[JOBNAME/107:JOBNAME #107]] java.util.EmptyStackException at java.util.Stack.peek(Stack.java:102) at java.util.Stack.pop(Stack.java:84) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1026) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:350) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 31, 2017 10:18:02 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[JOBNAME/38:JOBNAME #38]] java.util.EmptyStackException at java.util.Stack.peek(Stack.java:102) at java.util.Stack.pop(Stack.java:84) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1026) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:350) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748)
            jamesdumay James Dumay made changes -
            Sprint Blue Ocean 1.4 - beta 2 [ 326 ] Blue Ocean 1.4 - beta 2, Pipeline - December [ 326, 446 ]
            jamesdumay James Dumay made changes -
            Rank Ranked lower
            cloudbees CloudBees Inc. made changes -
            Remote Link This issue links to "CloudBees Internal CD-298 (Web Link)" [ 19061 ]
            svanoort Sam Van Oort made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            cloudbees CloudBees Inc. made changes -
            Remote Link This issue links to "CloudBees Internal CD-390 (Web Link)" [ 19741 ]
            Hide
            svanoort Sam Van Oort added a comment -

            Mike Kozell Have you tried the betas for the pipeline plugins that are currently in the experimental update center?  I'm fairly sure I fixed an error of this category when hardening the work in workflow-cps – this is also the same beta that provides the ability to prevent individual flows from resuming.  With a little work in the script console it should be possible to write a quick script to invoke that on all currently running builds. 

            Show
            svanoort Sam Van Oort added a comment - Mike Kozell  Have you tried the betas for the pipeline plugins that are currently in the experimental update center?  I'm fairly sure I fixed an error of this category when hardening the work in workflow-cps – this is also the same beta that provides the ability to prevent individual flows from resuming.  With a little work in the script console it should be possible to write a quick script to invoke that on all currently running builds. 
            jglick Jesse Glick made changes -
            Link This issue relates to JENKINS-49079 [ JENKINS-49079 ]
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Sam Van Oort
            Path:
            pom.xml
            src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowJob.java
            src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
            src/main/java/org/jenkinsci/plugins/workflow/job/properties/DisableResumeJobProperty.java
            src/main/java/org/jenkinsci/plugins/workflow/job/properties/DurabilityHintJobProperty.java
            src/main/resources/org/jenkinsci/plugins/workflow/job/properties/DisableResumeJobProperty/config-details.jelly
            src/main/resources/org/jenkinsci/plugins/workflow/job/properties/DurabilityHintJobProperty/config-details.jelly
            src/main/resources/org/jenkinsci/plugins/workflow/job/properties/DurabilityHintJobProperty/help.html
            src/test/java/org/jenkinsci/plugins/workflow/job/MemoryCleanupTest.java
            src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunRestartTest.java
            src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunTest.java
            src/test/java/org/jenkinsci/plugins/workflow/job/properties/DurabilityHintJobPropertyTest.java
            http://jenkins-ci.org/commit/workflow-job-plugin/5d3b91a68514d74422cc4ec5bc67d99418d7962c
            Log:
            Merge pull request #75 from svanoort/disable-pipeline-resume-JENKINS-33761

            Provide job property for durability hints & add ability to disable pipeline resume JENKINS-33761

            Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/2dfc94ac80bc...5d3b91a68514

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Sam Van Oort Path: pom.xml src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowJob.java src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java src/main/java/org/jenkinsci/plugins/workflow/job/properties/DisableResumeJobProperty.java src/main/java/org/jenkinsci/plugins/workflow/job/properties/DurabilityHintJobProperty.java src/main/resources/org/jenkinsci/plugins/workflow/job/properties/DisableResumeJobProperty/config-details.jelly src/main/resources/org/jenkinsci/plugins/workflow/job/properties/DurabilityHintJobProperty/config-details.jelly src/main/resources/org/jenkinsci/plugins/workflow/job/properties/DurabilityHintJobProperty/help.html src/test/java/org/jenkinsci/plugins/workflow/job/MemoryCleanupTest.java src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunRestartTest.java src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunTest.java src/test/java/org/jenkinsci/plugins/workflow/job/properties/DurabilityHintJobPropertyTest.java http://jenkins-ci.org/commit/workflow-job-plugin/5d3b91a68514d74422cc4ec5bc67d99418d7962c Log: Merge pull request #75 from svanoort/disable-pipeline-resume- JENKINS-33761 Provide job property for durability hints & add ability to disable pipeline resume JENKINS-33761 Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/2dfc94ac80bc...5d3b91a68514
            svanoort Sam Van Oort made changes -
            Status In Review [ 10005 ] Resolved [ 5 ]
            Resolution Fixed [ 1 ]
            Hide
            svanoort Sam Van Oort added a comment -

            Released with... uh, well take a look at the Jenkins Pipeline Handbook entry on scaling pipeline for versions.

            Show
            svanoort Sam Van Oort added a comment - Released with... uh, well take a look at the Jenkins Pipeline Handbook entry on scaling pipeline for versions.
            svanoort Sam Van Oort made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            Hide
            gregcovertsmith Greg Smith added a comment -

            For those watching, found direct link Sam mentioned:

            https://jenkins.io/doc/book/pipeline/scaling-pipeline/

            Show
            gregcovertsmith Greg Smith added a comment - For those watching, found direct link Sam mentioned: https://jenkins.io/doc/book/pipeline/scaling-pipeline/
            svanoort Sam Van Oort made changes -
            Labels project-cheetah
            Hide
            mkozell Mike Kozell added a comment - - edited

            Sam Van Oort

            After upgrading Jenkins with the following, I was not able to reproduce the issue after a build timeout, cancelling a build, and restarting Jenkins in the middle of a build.

            Jenkins 2.89.4
            Pipeline 2.5
            Pipeline API 2.26
            Pipeline Nodes and Processes 2.19
            Pipeline Step API 2.14
            Scripts Security 1.41
            durabilityHint=PERFORMANCE_OPTIMIZED
            org.jenkinsci.plugins.workflow.job.properties.DisableResumeJobProperty
            Groovy Sandbox = disabled
            Java = 1.8.0_162

            Although my jobs correctly didn't resume after Jenkins restart, I did see the message below in the build logs.

            Resuming build at Sat Feb 24 06:38:10 UTC 2018 after Jenkins restart
             [Pipeline] End of Pipeline
             java.io.IOException: Cannot resume build – was not cleanly saved when Jenkins shut down.
            Show
            mkozell Mike Kozell added a comment - - edited Sam Van Oort After upgrading Jenkins with the following, I was not able to reproduce the issue after a build timeout, cancelling a build, and restarting Jenkins in the middle of a build. Jenkins 2.89.4 Pipeline 2.5 Pipeline API 2.26 Pipeline Nodes and Processes 2.19 Pipeline Step API 2.14 Scripts Security 1.41 durabilityHint=PERFORMANCE_OPTIMIZED org.jenkinsci.plugins.workflow.job.properties.DisableResumeJobProperty Groovy Sandbox = disabled Java = 1.8.0_162 Although my jobs correctly didn't resume after Jenkins restart, I did see the message below in the build logs. Resuming build at Sat Feb 24 06:38:10 UTC 2018 after Jenkins restart [Pipeline] End of Pipeline java.io.IOException: Cannot resume build – was not cleanly saved when Jenkins shut down.
            jglick Jesse Glick made changes -
            Link This issue is blocked by JENKINS-49961 [ JENKINS-49961 ]
            Hide
            hellspam Roy Arnon added a comment -

            Hello,

            I am not sure this is related to this issue, but in our pipeline build job we recently added the disableResume step and it does not seem to work correctly:

            Jenkins 2.89.3
            Pipeline 2.5
            Pipeline API 2.27
            Pipeline Nodes and Processes 2.20
            Pipeline Step API 2.16
            Scripts Security 1.44
            durabilityHint=PERFORMANCE_OPTIMIZED
            org.jenkinsci.plugins.workflow.job.properties.DisableResumeJobProperty
            Groovy Sandbox = disabled

             

            Creating placeholder flownodes because failed loading originals.
            Resuming build at Thu Aug 30 12:42:45 UTC 2018 after Jenkins restart
            [Bitbucket] Notifying pull request build result
            [Bitbucket] Build result notified
            [lockable-resources] released lock on [UNIT_TEST_RESOURCE_3]
            java.io.IOException: Tried to load head FlowNodes for execution Owner[Products.Pipeline/PR-5615/7:Products.Pipeline/PR-5615 #7] but FlowNode was not found in storage for head id:FlowNodeId 1:586
            	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:678)
            	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:715)
            	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:875)
            	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:745)
            	at hudson.model.RunMap.retrieve(RunMap.java:225)
            	at hudson.model.RunMap.retrieve(RunMap.java:57)
            	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:500)
            	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:482)
            	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:380)
            	at hudson.model.RunMap.getById(RunMap.java:205)
            	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:1098)
            	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:1109)
            	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
            	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
            	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
            	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
            	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178)
            	at jenkins.model.Jenkins.<init>(Jenkins.java:974)
            	at hudson.model.Hudson.<init>(Hudson.java:86)
            	at hudson.model.Hudson.<init>(Hudson.java:82)
            	at hudson.WebAppMain$3.run(WebAppMain.java:233)
            Finished: SUCCESS

            This is an issue for us as the build was marked as SUCCESS in bitbucket, which allowed a user to merge a failing test into our release branch.

            The job was definitely running with resume disabled, as this was printed at start of job:

            Resume disabled by user, switching to high-performance, low-durability mode.

            Any ideas? 

            Show
            hellspam Roy Arnon added a comment - Hello, I am not sure this is related to this issue, but in our pipeline build job we recently added the disableResume step and it does not seem to work correctly: Jenkins 2.89.3 Pipeline 2.5 Pipeline API 2.27 Pipeline Nodes and Processes 2.20 Pipeline Step API 2.16 Scripts Security 1.44 durabilityHint=PERFORMANCE_OPTIMIZED org.jenkinsci.plugins.workflow.job.properties.DisableResumeJobProperty Groovy Sandbox = disabled   Creating placeholder flownodes because failed loading originals. Resuming build at Thu Aug 30 12:42:45 UTC 2018 after Jenkins restart [Bitbucket] Notifying pull request build result [Bitbucket] Build result notified [lockable-resources] released lock on [UNIT_TEST_RESOURCE_3] java.io.IOException: Tried to load head FlowNodes for execution Owner[Products.Pipeline/PR-5615/7:Products.Pipeline/PR-5615 #7] but FlowNode was not found in storage for head id:FlowNodeId 1:586 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:678) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:715) at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:875) at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:745) at hudson.model.RunMap.retrieve(RunMap.java:225) at hudson.model.RunMap.retrieve(RunMap.java:57) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:500) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:482) at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:380) at hudson.model.RunMap.getById(RunMap.java:205) at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:1098) at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:1109) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178) at jenkins.model.Jenkins.<init>(Jenkins.java:974) at hudson.model.Hudson.<init>(Hudson.java:86) at hudson.model.Hudson.<init>(Hudson.java:82) at hudson.WebAppMain$3.run(WebAppMain.java:233) Finished: SUCCESS This is an issue for us as the build was marked as SUCCESS in bitbucket, which allowed a user to merge a failing test into our release branch. The job was definitely running with resume disabled, as this was printed at start of job: Resume disabled by user, switching to high-performance, low-durability mode. Any ideas? 
            Hide
            rg Russell Gallop added a comment -

            We have seen the same thing. Resume definitely disabled and still causing hangs.

            Show
            rg Russell Gallop added a comment - We have seen the same thing. Resume definitely disabled and still causing hangs.
            rg Russell Gallop made changes -
            Attachment JENKINS-33671_thread_dump.txt [ 44546 ]
            Hide
            rg Russell Gallop added a comment -

            Thread dump from a node this is happening on attached.

             

            Jenkins 2.107.3

            Pipeline 2.5

            Pipeline API 2.27

            Pipeline Nodes and Processes 2.19

            Pipeline Step API 2.15

            JENKINS-33671_thread_dump.txt

             

            Oddly, having killed the job from "Build Executor Status" the node is freed up but the job seems to still think it is running:

            [Pipeline] {

            Creating placeholder flownodes because failed loading originals.

            Resuming build at Thu Sep 20 11:47:46 BST 2018 after Jenkins restart

            [Pipeline] End of Pipeline

            Finished: FAILURE

            <spinning indicator>

             

            The next thing that this would be doing would be retry { checkout git ... }

             

             

             

            Show
            rg Russell Gallop added a comment - Thread dump from a node this is happening on attached.   Jenkins 2.107.3 Pipeline 2.5 Pipeline API 2.27 Pipeline Nodes and Processes 2.19 Pipeline Step API 2.15 JENKINS-33671_thread_dump.txt   Oddly, having killed the job from "Build Executor Status" the node is freed up but the job seems to still think it is running: [Pipeline] { Creating placeholder flownodes because failed loading originals. Resuming build at Thu Sep 20 11:47:46 BST 2018 after Jenkins restart [Pipeline] End of Pipeline Finished: FAILURE <spinning indicator>   The next thing that this would be doing would be retry { checkout git ... }      
            Hide
            medianick Nick Jones added a comment - - edited

            I've just experienced the "Creating placeholder flownodes because failed loading originals." error with this stack trace on a Jenkins system running workflow-job 2.25 and workflow-cps 2.64:

            java.io.IOException: Tried to load head FlowNodes for execution Owner[Redacted/dev/3:Redacted/dev #3] but FlowNode was not found in storage for head id:FlowNodeId 1:59
            	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:678)
            	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:715)
            	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:875)
            	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:745)
            	at hudson.model.RunMap.retrieve(RunMap.java:225)
            	at hudson.model.RunMap.retrieve(RunMap.java:57)
            	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:501)
            	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:483)
            	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:381)
            	at hudson.model.RunMap.getById(RunMap.java:205)
            	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:1112)
            	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:1123)
            	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
            	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
            	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
            	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
            	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178)
            	at jenkins.model.Jenkins.<init>(Jenkins.java:989)
            	at hudson.model.Hudson.<init>(Hudson.java:85)
            	at hudson.model.Hudson.<init>(Hudson.java:81)
            	at hudson.WebAppMain$3.run(WebAppMain.java:233)
            Finished: FAILURE
            

            Restarting the job manually appears to have resolved it, but is there additional information I can provide to troubleshoot what might have caused this? Or is this a different issue than what's discussed here?

            Edit: I should add that workflow-cps was upgraded from 2.63 to 2.64 between the last successful job and the one that failed with the stack trace above. Workflow-job was not changed.

            Show
            medianick Nick Jones added a comment - - edited I've just experienced the "Creating placeholder flownodes because failed loading originals." error with this stack trace on a Jenkins system running workflow-job 2.25 and workflow-cps 2.64: java.io.IOException: Tried to load head FlowNodes for execution Owner[Redacted/dev/3:Redacted/dev #3] but FlowNode was not found in storage for head id:FlowNodeId 1:59 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:678) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:715) at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:875) at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:745) at hudson.model.RunMap.retrieve(RunMap.java:225) at hudson.model.RunMap.retrieve(RunMap.java:57) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:501) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:483) at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:381) at hudson.model.RunMap.getById(RunMap.java:205) at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:1112) at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:1123) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178) at jenkins.model.Jenkins.<init>(Jenkins.java:989) at hudson.model.Hudson.<init>(Hudson.java:85) at hudson.model.Hudson.<init>(Hudson.java:81) at hudson.WebAppMain$3.run(WebAppMain.java:233) Finished: FAILURE Restarting the job manually appears to have resolved it, but is there additional information I can provide to troubleshoot what might have caused this? Or is this a different issue than what's discussed here? Edit: I should add that workflow-cps was upgraded from 2.63 to 2.64 between the last successful job and the one that failed with the stack trace above. Workflow-job was not changed.

              People

              • Assignee:
                svanoort Sam Van Oort
                Reporter:
                jtilander Jim Tilander
              • Votes:
                47 Vote for this issue
                Watchers:
                51 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: