Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27371

Parent builds sometimes hang on successful child builds of same type

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Postponed
    • Component/s: multijob-plugin
    • Labels:
      None
    • Environment:
      jenkins 1.593
      multijob plugin 1.16
    • Similar Issues:

      Description

      With a job configuration like:

      parent_job:
      child_job
      child_job

      occasionally we'll get output that looks like this:

      Starting build job child_job.
      Starting build job child_job.
      Finished Build : #123 - Job : child_job with status : SUCCESS
      <this hangs forever, so we abort the parent_job>
      Aborting all subjobs.
      Finished Build : #124 - Job : child_job with status : ABORTED

      However, job #124 will always have finished successfully. We see this relatively rarely (~5% of parent_job builds?). My guess is that there's a race condition causing this, since we usually see this during times of high load (ie many child_job instances are being started).

        Attachments

          Issue Links

            Activity

            Hide
            mcantin Mathieu Cantin added a comment -

            You have the option "Kill the phase on:" for each sub-jobs on a phase. When this is on "Failure", this kill all others builds when this sub-job has failed. Select "never" to never kill others sub-jobs.

            The option "Abort all other job" is when the job is aborted manually by an user.

            Show
            mcantin Mathieu Cantin added a comment - You have the option "Kill the phase on:" for each sub-jobs on a phase. When this is on "Failure", this kill all others builds when this sub-job has failed. Select "never" to never kill others sub-jobs. The option "Abort all other job" is when the job is aborted manually by an user.
            Hide
            simonmweber Simon Weber added a comment -

            I'm not sure that has anything to do with the problem? The problem is that multijob will sometimes lose track of a child job and wait forever for its result. That forces us to abort the parent build (or else the parent job would run indefinitely).

            Show
            simonmweber Simon Weber added a comment - I'm not sure that has anything to do with the problem? The problem is that multijob will sometimes lose track of a child job and wait forever for its result. That forces us to abort the parent build (or else the parent job would run indefinitely).
            Hide
            mcantin Mathieu Cantin added a comment -

            Sorry, I see the problem. Do you have the "Execute concurrent builds if necessary" checked on child_job ?

            Show
            mcantin Mathieu Cantin added a comment - Sorry, I see the problem. Do you have the "Execute concurrent builds if necessary" checked on child_job ?
            Hide
            simonmweber Simon Weber added a comment -

            > Sorry, I see the problem

            Oh, no worries; I could have done a better job describing the problem.

            > Do you have the "Execute concurrent builds if necessary" checked on child_job?

            Yes; on both child_job and parent_job. I had originally suspected this only happened when job numbers were interleaved due to another concurrent build, but that didn't happen in an example I just triggered now. I'll check the logs to see if there's anything interesting.

            Show
            simonmweber Simon Weber added a comment - > Sorry, I see the problem Oh, no worries; I could have done a better job describing the problem. > Do you have the "Execute concurrent builds if necessary" checked on child_job? Yes; on both child_job and parent_job. I had originally suspected this only happened when job numbers were interleaved due to another concurrent build, but that didn't happen in an example I just triggered now. I'll check the logs to see if there's anything interesting.
            Hide
            simonmweber Simon Weber added a comment -

            The logs aren't too informative. My comments are in [ square brackets ].

            parent_job log

            Starting build job child_job.
            Starting build job child_job.
            Finished Build : #6631 of Job : child_job with status : SUCCESS
            Build timed out (after 20 minutes). Marking the build as aborted.  [ this is from a plugin we use to timeout automatically ]
            Aborting all subjobs.
            Finished Build : #6630 of Job : child_job with status : ABORTED [ note that the parent job knows about the correct subjob since it knows which to abort ]
            Build was aborted
            Finished: ABORTED
            

            Jenkins log:

            Mar 12, 2015 2:34:40 PM INFO hudson.model.Run execute
            child_job #6631 main build action completed: SUCCESS
            Mar 12, 2015 2:34:45 PM INFO hudson.model.Run execute
            child_job #6630 main build action completed: SUCCESS  [ this subjob did finish successfully and on time, but the parent job missed its result ]
            
            <snip>
            
            Mar 12, 2015 2:36:52 PM SEVERE hudson.model.Executor run [I don't think this is relevant - it happens later than the job finishes - but I figured I'd include it ]
            Executor threw an exception
            java.util.NoSuchElementException
                at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
                at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
                at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
                at hudson.util.RunList.subList(RunList.java:137)
                at hudson.tasks.LogRotator.perform(LogRotator.java:124)
                at hudson.model.Job.logRotate(Job.java:449)
                at hudson.model.Run.execute(Run.java:1823)
                at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
                at hudson.model.ResourceController.execute(ResourceController.java:89)
                at hudson.model.Executor.run(Executor.java:240)
            
            
            Show
            simonmweber Simon Weber added a comment - The logs aren't too informative. My comments are in [ square brackets ]. parent_job log Starting build job child_job. Starting build job child_job. Finished Build : #6631 of Job : child_job with status : SUCCESS Build timed out (after 20 minutes). Marking the build as aborted. [ this is from a plugin we use to timeout automatically ] Aborting all subjobs. Finished Build : #6630 of Job : child_job with status : ABORTED [ note that the parent job knows about the correct subjob since it knows which to abort ] Build was aborted Finished: ABORTED Jenkins log: Mar 12, 2015 2:34:40 PM INFO hudson.model.Run execute child_job #6631 main build action completed: SUCCESS Mar 12, 2015 2:34:45 PM INFO hudson.model.Run execute child_job #6630 main build action completed: SUCCESS [ this subjob did finish successfully and on time, but the parent job missed its result ] <snip> Mar 12, 2015 2:36:52 PM SEVERE hudson.model.Executor run [I don 't think this is relevant - it happens later than the job finishes - but I figured I' d include it ] Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:449) at hudson.model.Run.execute(Run.java:1823) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)
            Hide
            mcantin Mathieu Cantin added a comment -

            Good new (or bad). This seen a change on the API of Jenkins 1.596.1 (we have updated last friday). I work on a fix.

            Show
            mcantin Mathieu Cantin added a comment - Good new (or bad). This seen a change on the API of Jenkins 1.596.1 (we have updated last friday). I work on a fix.
            Hide
            simonmweber Simon Weber added a comment -

            Is the api change you're thinking of in that specific version? We're still on version 1.593, and I think we saw this behavior on a version < 1.565.1 as well.

            Show
            simonmweber Simon Weber added a comment - Is the api change you're thinking of in that specific version? We're still on version 1.593, and I think we saw this behavior on a version < 1.565.1 as well.
            Hide
            danielbeck Daniel Beck added a comment -

            If something hangs, thread dumps would be helpful.

            https://wiki.jenkins-ci.org/display/JENKINS/Obtaining+a+thread+dump

            Show
            danielbeck Daniel Beck added a comment - If something hangs, thread dumps would be helpful. https://wiki.jenkins-ci.org/display/JENKINS/Obtaining+a+thread+dump
            Hide
            simonmweber Simon Weber added a comment -

            I finally got a thread dump! It looks like the problem is at this line: https://github.com/jenkinsci/tikal-multijob-plugin/blob/jenkins-multijob-plugin-1.16/src/main/java/com/tikal/jenkins/plugins/multijob/MultiJobBuilder.java#L216.

            Here's the relevant thread dump:

            "Executor #13 for qa-jenkins-psuedorunner-016 : executing platform_pr #2036" Id=257890 Group=main TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de
            	at sun.misc.Unsafe.park(Native Method)
            	-  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de
            	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
            	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
            	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
            	at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:216)
            	at org.jenkins_ci.plugins.run_condition.BuildStepRunner$2.run(BuildStepRunner.java:110)
            	at org.jenkins_ci.plugins.run_condition.BuildStepRunner$DontRun.conditionalRun(BuildStepRunner.java:264)
            	at org.jenkins_ci.plugins.run_condition.BuildStepRunner.perform(BuildStepRunner.java:105)
            	at org.jenkinsci.plugins.conditionalbuildstep.singlestep.SingleConditionalBuilder.perform(SingleConditionalBuilder.java:108)
            	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
            	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
            	at hudson.model.Build$BuildExecution.build(Build.java:199)
            	at hudson.model.Build$BuildExecution.doRun(Build.java:160)
            	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533)
            	at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:134)
            	at hudson.model.Run.execute(Run.java:1759)
            	at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73)
            	at hudson.model.ResourceController.execute(ResourceController.java:89)
            	at hudson.model.Executor.run(Executor.java:240)
            

            Let me know if you need any other information from the dump – I saved it all just in case.

            Show
            simonmweber Simon Weber added a comment - I finally got a thread dump! It looks like the problem is at this line: https://github.com/jenkinsci/tikal-multijob-plugin/blob/jenkins-multijob-plugin-1.16/src/main/java/com/tikal/jenkins/plugins/multijob/MultiJobBuilder.java#L216 . Here's the relevant thread dump: "Executor #13 for qa-jenkins-psuedorunner-016 : executing platform_pr #2036" Id=257890 Group=main TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:216) at org.jenkins_ci.plugins.run_condition.BuildStepRunner$2.run(BuildStepRunner.java:110) at org.jenkins_ci.plugins.run_condition.BuildStepRunner$DontRun.conditionalRun(BuildStepRunner.java:264) at org.jenkins_ci.plugins.run_condition.BuildStepRunner.perform(BuildStepRunner.java:105) at org.jenkinsci.plugins.conditionalbuildstep.singlestep.SingleConditionalBuilder.perform(SingleConditionalBuilder.java:108) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770) at hudson.model.Build$BuildExecution.build(Build.java:199) at hudson.model.Build$BuildExecution.doRun(Build.java:160) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533) at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:134) at hudson.model.Run.execute(Run.java:1759) at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240) Let me know if you need any other information from the dump – I saved it all just in case.
            Hide
            simonmweber Simon Weber added a comment -

            Here's a log message regarding the subjob that finished, but was still being waited on. My comments in [ square brackets ]:

            Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute
            platform_parallel #2620 main build action completed: SUCCESS  [ this was the job the platform_pr was waiting on ]
            
            Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute
            platform_parallel #2619 main build action completed: SUCCESS
            
            Apr 01, 2015 5:52:20 PM WARNING jenkins.model.lazy.LazyBuildMixIn removeRun
            com.tikal.jenkins.plugins.multijob.MultiJobProject@4ff2acc8[platform_parallel] did not contain platform_parallel #1620 to begin with  [ coincidence? was this supposed to be #2620? ]
            
            Apr 01, 2015 5:52:20 PM SEVERE hudson.model.Executor run
            Executor threw an exception
            java.util.NoSuchElementException
            	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
            	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
            	at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
            	at hudson.util.RunList.subList(RunList.java:137)
            	at hudson.tasks.LogRotator.perform(LogRotator.java:124)
            	at hudson.model.Job.logRotate(Job.java:449)
            	at hudson.model.Run.execute(Run.java:1823)
            	at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73)
            	at hudson.model.ResourceController.execute(ResourceController.java:89)
            	at hudson.model.Executor.run(Executor.java:240)
            
            Show
            simonmweber Simon Weber added a comment - Here's a log message regarding the subjob that finished, but was still being waited on. My comments in [ square brackets ]: Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute platform_parallel #2620 main build action completed: SUCCESS [ this was the job the platform_pr was waiting on ] Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute platform_parallel #2619 main build action completed: SUCCESS Apr 01, 2015 5:52:20 PM WARNING jenkins.model.lazy.LazyBuildMixIn removeRun com.tikal.jenkins.plugins.multijob.MultiJobProject@4ff2acc8[platform_parallel] did not contain platform_parallel #1620 to begin with [ coincidence? was this supposed to be #2620? ] Apr 01, 2015 5:52:20 PM SEVERE hudson.model.Executor run Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:449) at hudson.model.Run.execute(Run.java:1823) at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)
            Hide
            mcantin Mathieu Cantin added a comment -

            I created this pull request, https://github.com/jenkinsci/tikal-multijob-plugin/pull/65, and I think this will fix that.

            Show
            mcantin Mathieu Cantin added a comment - I created this pull request, https://github.com/jenkinsci/tikal-multijob-plugin/pull/65 , and I think this will fix that.
            Hide
            simonmweber Simon Weber added a comment -

            Can we get some eyes on that pull request? We're still running into this issue daily.

            Show
            simonmweber Simon Weber added a comment - Can we get some eyes on that pull request? We're still running into this issue daily.
            Hide
            simonmweber Simon Weber added a comment -

            @mcantin, I just installed your changes from source and unfortunately we're still seeing the problem. I've only seen hangs when one of the child builds failed, in case that's useful.

            Show
            simonmweber Simon Weber added a comment - @mcantin, I just installed your changes from source and unfortunately we're still seeing the problem. I've only seen hangs when one of the child builds failed, in case that's useful.
            Hide
            mjmj Michael Merritt added a comment - - edited

            We're also seeing this same issue that Simon is describing and it's causing deploy headaches for us. Frequently having to kill the job and rebuild it and roll the dice to see if it's going to work. Our logs look similar to those mentioned above. Let me know if you need more info.

            Jenkins 1.607
            Multijob 1.16

            website-deploy-production-download #10432 main build action completed: SUCCESS
            Jun 10, 2015 8:50:19 PM SEVERE hudson.model.Executor finish1
            Executor threw an exception
            java.util.NoSuchElementException
            	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
            	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
            	at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
            	at hudson.util.RunList.subList(RunList.java:137)
            	at hudson.tasks.LogRotator.perform(LogRotator.java:124)
            	at hudson.model.Job.logRotate(Job.java:465)
            	at hudson.model.Run.execute(Run.java:1805)
            	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
            	at hudson.model.ResourceController.execute(ResourceController.java:98)
            	at hudson.model.Executor.run(Executor.java:374)
            
            Show
            mjmj Michael Merritt added a comment - - edited We're also seeing this same issue that Simon is describing and it's causing deploy headaches for us. Frequently having to kill the job and rebuild it and roll the dice to see if it's going to work. Our logs look similar to those mentioned above. Let me know if you need more info. Jenkins 1.607 Multijob 1.16 website-deploy-production-download #10432 main build action completed: SUCCESS Jun 10, 2015 8:50:19 PM SEVERE hudson.model.Executor finish1 Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:465) at hudson.model.Run.execute(Run.java:1805) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374)
            Hide
            mcantin Mathieu Cantin added a comment -

            Since ~ 1 month, I try a new fix and I don't have again this bug. Can you try it? The PR : https://github.com/jenkinsci/tikal-multijob-plugin/pull/65 and the build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/159/.

            Show
            mcantin Mathieu Cantin added a comment - Since ~ 1 month, I try a new fix and I don't have again this bug. Can you try it? The PR : https://github.com/jenkinsci/tikal-multijob-plugin/pull/65 and the build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/159/ .
            Hide
            mjmj Michael Merritt added a comment -

            Thanks, we'll give it a try.

            Show
            mjmj Michael Merritt added a comment - Thanks, we'll give it a try.
            Hide
            simonmweber Simon Weber added a comment -

            Mathieu Cantin, that fix didn't work for us: it made all builds hang.

            I found this error in the logs:

            Jun 19, 2015 9:34:03 PM WARNING hudson.triggers.Trigger checkTriggers
            org.jenkinsci.plugins.ghprb.GhprbTrigger.run() failed for hudson.model.FreeStyleProject@b347b18[venmo_platform_auto_pr]
            java.lang.NullPointerException
            	at org.jenkinsci.plugins.ghprb.GhprbTrigger.run(GhprbTrigger.java:155)
            	at hudson.triggers.Trigger.checkTriggers(Trigger.java:266)
            	at hudson.triggers.Trigger$Cron.doRun(Trigger.java:214)
            	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
            	at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
            	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
            	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
            	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            	at java.lang.Thread.run(Thread.java:724)
            
            Show
            simonmweber Simon Weber added a comment - Mathieu Cantin , that fix didn't work for us: it made all builds hang. I found this error in the logs: Jun 19, 2015 9:34:03 PM WARNING hudson.triggers.Trigger checkTriggers org.jenkinsci.plugins.ghprb.GhprbTrigger.run() failed for hudson.model.FreeStyleProject@b347b18[venmo_platform_auto_pr] java.lang.NullPointerException at org.jenkinsci.plugins.ghprb.GhprbTrigger.run(GhprbTrigger.java:155) at hudson.triggers.Trigger.checkTriggers(Trigger.java:266) at hudson.triggers.Trigger$Cron.doRun(Trigger.java:214) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:724)
            Hide
            mcantin Mathieu Cantin added a comment - - edited

            Simon Weber
            I'm sorry. Did you update the "GitHub pull request builder plugin" (https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin) recently? Because, the error was on this plugin with a "Free Style Project". Which version do you have?

            Show
            mcantin Mathieu Cantin added a comment - - edited Simon Weber I'm sorry. Did you update the "GitHub pull request builder plugin" ( https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin ) recently? Because, the error was on this plugin with a "Free Style Project". Which version do you have?
            Hide
            simonmweber Simon Weber added a comment -

            Whoops! I pasted the wrong thing. Let me do some debugging and get back to you.

            Show
            simonmweber Simon Weber added a comment - Whoops! I pasted the wrong thing. Let me do some debugging and get back to you.
            Hide
            simonmweber Simon Weber added a comment -

            Ok, I had to fix a few unrelated problems, haha. The hanging I was seeing before was caused by Jenkins taking a long time to discard old builds. I'm not sure why that was triggered by updating the plugin and restarting – maybe I just got unlucky?

            Anyway, after fixing all that other stuff and upgrading, things look good! I'll know for sure next week once we have some real load on it, but this looks promising =)

            Show
            simonmweber Simon Weber added a comment - Ok, I had to fix a few unrelated problems, haha. The hanging I was seeing before was caused by Jenkins taking a long time to discard old builds. I'm not sure why that was triggered by updating the plugin and restarting – maybe I just got unlucky? Anyway, after fixing all that other stuff and upgrading, things look good! I'll know for sure next week once we have some real load on it, but this looks promising =)
            Hide
            simonmweber Simon Weber added a comment -

            Mathieu Cantin bummer; we just had a build hang.

            Here are the logs, my comments <"like this"> so jira highlights them.

            From the build itself:

            Started by upstream project "platform_deploy_listener" build number 219
            <"snip">
            [MultiJob] Starting job platform_parallel.
            [MultiJob] Starting job platform_parallel.
            [MultiJob] Starting job venmo_platform_external_integration.
            [MultiJob] Starting job venmo_platform_external_integration.
            [MultiJob] Starting job scope_frontend.
            [MultiJob] Finished Build : #96 - deploy of Job : scope_frontend with status : SUCCESS
            [MultiJob] Finished Build : #6837 of Job : platform_parallel with status : SUCCESS
            [MultiJob] Finished Build : #6836 of Job : platform_parallel with status : SUCCESS
            [MultiJob] Finished Build : #1248 - deploy of Job : venmo_platform_external_integration with status : SUCCESS
            Build timed out (after 20 minutes). Marking the build as aborted.
            [MultiJob] Aborting all subjobs.
            [MultiJob] Aborting platform_parallel.
            [MultiJob] Aborting platform_parallel.
            [MultiJob] Aborting venmo_platform_external_integration.
            [MultiJob] Aborting venmo_platform_external_integration.
            [MultiJob] Aborting scope_frontend.
            Build was aborted
            <"snip">
            [MultiJob] Finished Build : #1247 - deploy of Job : venmo_platform_external_integration with status : ABORTED <"here's the build that hung">
            

            From Jenkins:

            Jun 22, 2015 2:58:55 PM INFO
            venmo_platform_test_and_deploy - #863 Started by ...
            Jun 22, 2015 3:09:10 PM INFO hudson.model.Run execute
            venmo_platform_external_integration #1247 main build action completed: SUCCESS <"here we see it completed successfully">
            Jun 22, 2015 3:09:11 PM INFO hudson.model.Run execute
            venmo_platform_external_integration #1248 main build action completed: SUCCESS
            Jun 22, 2015 3:09:20 PM SEVERE hudson.model.Executor finish1
            Executor threw an exception
            java.util.NoSuchElementException <"perhaps this is related? We keep the most recent 1000 builds.">
            	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
            	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
            	at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
            	at hudson.util.RunList.subList(RunList.java:139)
            	at hudson.tasks.LogRotator.perform(LogRotator.java:125)
            	at hudson.model.Job.logRotate(Job.java:467)
            	at hudson.model.Run.execute(Run.java:1808)
            	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
            	at hudson.model.ResourceController.execute(ResourceController.java:98)
            	at hudson.model.Executor.run(Executor.java:374)
            
            <"these look to be expected">
            Jun 22, 2015 3:18:48 PM INFO hudson.model.Run execute
            platform_deploy_listener #219 aborted
            java.lang.InterruptedException
            	at java.lang.Object.wait(Native Method)
            	at java.lang.Object.wait(Object.java:503)
            	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
            	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:135)
            	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
            	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
            	at hudson.model.Build$BuildExecution.build(Build.java:205)
            	at hudson.model.Build$BuildExecution.doRun(Build.java:162)
            	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
            	at hudson.model.Run.execute(Run.java:1744)
            	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
            	at hudson.model.ResourceController.execute(ResourceController.java:98)
            	at hudson.model.Executor.run(Executor.java:374)
            
            
            Jun 22, 2015 3:18:57 PM INFO hudson.model.Run execute
            venmo_platform_test_and_deploy #863 aborted
            java.lang.InterruptedException
            	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
            	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
            	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
            	at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:220)
            	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
            	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
            	at hudson.model.Build$BuildExecution.build(Build.java:205)
            	at hudson.model.Build$BuildExecution.doRun(Build.java:162)
            	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
            	at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:137)
            	at hudson.model.Run.execute(Run.java:1744)
            	at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:76)
            	at hudson.model.ResourceController.execute(ResourceController.java:98)
            	at hudson.model.Executor.run(Executor.java:374)
            
            Show
            simonmweber Simon Weber added a comment - Mathieu Cantin bummer; we just had a build hang. Here are the logs, my comments <"like this"> so jira highlights them. From the build itself: Started by upstream project "platform_deploy_listener" build number 219 < "snip" > [MultiJob] Starting job platform_parallel. [MultiJob] Starting job platform_parallel. [MultiJob] Starting job venmo_platform_external_integration. [MultiJob] Starting job venmo_platform_external_integration. [MultiJob] Starting job scope_frontend. [MultiJob] Finished Build : #96 - deploy of Job : scope_frontend with status : SUCCESS [MultiJob] Finished Build : #6837 of Job : platform_parallel with status : SUCCESS [MultiJob] Finished Build : #6836 of Job : platform_parallel with status : SUCCESS [MultiJob] Finished Build : #1248 - deploy of Job : venmo_platform_external_integration with status : SUCCESS Build timed out (after 20 minutes). Marking the build as aborted. [MultiJob] Aborting all subjobs. [MultiJob] Aborting platform_parallel. [MultiJob] Aborting platform_parallel. [MultiJob] Aborting venmo_platform_external_integration. [MultiJob] Aborting venmo_platform_external_integration. [MultiJob] Aborting scope_frontend. Build was aborted < "snip" > [MultiJob] Finished Build : #1247 - deploy of Job : venmo_platform_external_integration with status : ABORTED < "here's the build that hung" > From Jenkins: Jun 22, 2015 2:58:55 PM INFO venmo_platform_test_and_deploy - #863 Started by ... Jun 22, 2015 3:09:10 PM INFO hudson.model.Run execute venmo_platform_external_integration #1247 main build action completed: SUCCESS < "here we see it completed successfully" > Jun 22, 2015 3:09:11 PM INFO hudson.model.Run execute venmo_platform_external_integration #1248 main build action completed: SUCCESS Jun 22, 2015 3:09:20 PM SEVERE hudson.model.Executor finish1 Executor threw an exception java.util.NoSuchElementException < "perhaps this is related? We keep the most recent 1000 builds." > at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:139) at hudson.tasks.LogRotator.perform(LogRotator.java:125) at hudson.model.Job.logRotate(Job.java:467) at hudson.model.Run.execute(Run.java:1808) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374) < "these look to be expected" > Jun 22, 2015 3:18:48 PM INFO hudson.model.Run execute platform_deploy_listener #219 aborted java.lang.InterruptedException at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:503) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:135) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779) at hudson.model.Build$BuildExecution.build(Build.java:205) at hudson.model.Build$BuildExecution.doRun(Build.java:162) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537) at hudson.model.Run.execute(Run.java:1744) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374) Jun 22, 2015 3:18:57 PM INFO hudson.model.Run execute venmo_platform_test_and_deploy #863 aborted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:220) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779) at hudson.model.Build$BuildExecution.build(Build.java:205) at hudson.model.Build$BuildExecution.doRun(Build.java:162) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537) at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:137) at hudson.model.Run.execute(Run.java:1744) at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:76) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374)
            Hide
            mcantin Mathieu Cantin added a comment -

            It's difficult to debug. I added debug information in this build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/166/ (lines started with [MultiJob][Debug]). This could help to debug.

            Show
            mcantin Mathieu Cantin added a comment - It's difficult to debug. I added debug information in this build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/166/ (lines started with [MultiJob] [Debug] ). This could help to debug.
            Hide
            simonmweber Simon Weber added a comment -

            Sounds good. I'll update to that revision and hopefully have an example in the next few days.

            Show
            simonmweber Simon Weber added a comment - Sounds good. I'll update to that revision and hopefully have an example in the next few days.
            Hide
            simonmweber Simon Weber added a comment -

            So I've got my first example, but it's actually not for a multijob build! In this case, the parent build used https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin to trigger and block on the downstream build (which itself was a multijob build). The parent build was timed out after 20 minutes while the child was successful.

            So, perhaps the problem is actually with that plugin instead? My understanding is that it's a dependency of the multijob plugin. My installed version was 2.25 (I've since upgraded to 2.27 in case that fixes it).

            Here are the logs, but they're not too interesting:

            parent build:

            Waiting for the completion of platform_parallel
            Build timed out (after 20 minutes). Marking the build as aborted.
            Build was aborted
            

            child (multijob) build:

            [MultiJob] Starting job venmo_platform.
            [MultiJob] Starting job venmo_platform.
            [MultiJob] Starting job venmo_platform.
            [MultiJob][Debug] Create executor service.
            [MultiJob][Debug] Create task venmo_platform
            [MultiJob][Debug] Create task venmo_platform
            [MultiJob][Debug] Create task venmo_platform
            [MultiJob][Debug] Shutdown.
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] Timeoutexception, continue
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] Timeoutexception, continue
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] Timeoutexception, continue
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] Timeoutexception, continue
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] Timeoutexception, continue
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] Timeoutexception, continue
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] future is started
            [MultiJob][Debug] future is started
            [MultiJob][Debug] future is started
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] future is started
            [MultiJob][Debug] Thy to start future for venmo_platform
            [MultiJob][Debug] future is started
            <"snip; this repeats for a long time">
            [MultiJob][Debug] Result value is SUCCESS
            [MultiJob] Finished Build : #29045 -  of Job : venmo_platform with status : SUCCESS
            [MultiJob][Debug]  Finish execution of SUCCESS
            [MultiJob][Debug] Check phase termination.
            <"one of these for each subbuild">
            [MultiJob][Debug] Continuation calculation.
            Finished: SUCCESS
            
            Show
            simonmweber Simon Weber added a comment - So I've got my first example, but it's actually not for a multijob build! In this case, the parent build used https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin to trigger and block on the downstream build (which itself was a multijob build). The parent build was timed out after 20 minutes while the child was successful. So, perhaps the problem is actually with that plugin instead? My understanding is that it's a dependency of the multijob plugin. My installed version was 2.25 (I've since upgraded to 2.27 in case that fixes it). Here are the logs, but they're not too interesting: parent build: Waiting for the completion of platform_parallel Build timed out (after 20 minutes). Marking the build as aborted. Build was aborted child (multijob) build: [MultiJob] Starting job venmo_platform. [MultiJob] Starting job venmo_platform. [MultiJob] Starting job venmo_platform. [MultiJob][Debug] Create executor service. [MultiJob][Debug] Create task venmo_platform [MultiJob][Debug] Create task venmo_platform [MultiJob][Debug] Create task venmo_platform [MultiJob][Debug] Shutdown. [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] future is started [MultiJob][Debug] future is started [MultiJob][Debug] future is started [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] future is started [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] future is started < "snip; this repeats for a long time" > [MultiJob][Debug] Result value is SUCCESS [MultiJob] Finished Build : #29045 - of Job : venmo_platform with status : SUCCESS [MultiJob][Debug] Finish execution of SUCCESS [MultiJob][Debug] Check phase termination. < "one of these for each subbuild" > [MultiJob][Debug] Continuation calculation. Finished: SUCCESS
            Hide
            simonmweber Simon Weber added a comment -

            I just triggered an identical hang with the upgraded plugin, so it doesn't look like it's fixed.

            Show
            simonmweber Simon Weber added a comment - I just triggered an identical hang with the upgraded plugin, so it doesn't look like it's fixed.
            Show
            elgalu Leo Gallucci added a comment - I can recreate when using https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin
            Hide
            mcantin Mathieu Cantin added a comment -

            Which version of Jenkins are you running?

            Show
            mcantin Mathieu Cantin added a comment - Which version of Jenkins are you running?
            Hide
            elgalu Leo Gallucci added a comment -

            Jenkins ver. 1.609.1

            Show
            elgalu Leo Gallucci added a comment - Jenkins ver. 1.609.1
            Hide
            simonmweber Simon Weber added a comment -

            We haven't seen this issue in a few months. We're now running Jenkins 1.628, trigger plugin 2.27, and multijob from source at https://github.com/jenkinsci/tikal-multijob-plugin/pull/65.

            Show
            simonmweber Simon Weber added a comment - We haven't seen this issue in a few months. We're now running Jenkins 1.628, trigger plugin 2.27, and multijob from source at https://github.com/jenkinsci/tikal-multijob-plugin/pull/65 .
            Hide
            tbridges Tony Bridges added a comment -

            We've seen this a few times, most recently in a MultiJob run calling a maven job, but previously in a Freestyle job calling maven jobs.
            We are running 1.609.3.1, CBE.
            I've opened a ticket with CB Support and cited this defect.

            Show
            tbridges Tony Bridges added a comment - We've seen this a few times, most recently in a MultiJob run calling a maven job, but previously in a Freestyle job calling maven jobs. We are running 1.609.3.1, CBE. I've opened a ticket with CB Support and cited this defect.
            Hide
            tbridges Tony Bridges added a comment -

            Does the pull request conclusively correct the issue ? If so, maybe we can get it rolled into a new plugin release ?

            Show
            tbridges Tony Bridges added a comment - Does the pull request conclusively correct the issue ? If so, maybe we can get it rolled into a new plugin release ?
            Hide
            simonmweber Simon Weber added a comment -

            Tony Bridges I don't think this was solved for us until I upgraded Jenkins. I'm not sure what combination of the jenkins, trigger plugin, and multijob plugin upgrades is necessary to solve it, but the environment I described earlier has been rock solid for months now.

            Show
            simonmweber Simon Weber added a comment - Tony Bridges I don't think this was solved for us until I upgraded Jenkins. I'm not sure what combination of the jenkins, trigger plugin, and multijob plugin upgrades is necessary to solve it, but the environment I described earlier has been rock solid for months now.
            Hide
            ifernandezcalvo Ivan Fernandez Calvo added a comment -

            How to reproduce it:

            • Configure a FreeStyle Project - freeStyle01
            • Configure a Maven Project - maven01
            • Configure a Multijob Project with 30 subjobs (enough to have time to stop slave agent) - Multijob01
            • Configure one slave agent with 1 executor - slave01
            • Run Multijob01
            • Go to Manage Jenkins/Manage Nodes and put slave01 offline or kill "java -jar slave.jar" on slave agent
            • Multijob01 hang waiting for finish task
            Show
            ifernandezcalvo Ivan Fernandez Calvo added a comment - How to reproduce it: Configure a FreeStyle Project - freeStyle01 Configure a Maven Project - maven01 Configure a Multijob Project with 30 subjobs (enough to have time to stop slave agent) - Multijob01 Configure one slave agent with 1 executor - slave01 Run Multijob01 Go to Manage Jenkins/Manage Nodes and put slave01 offline or kill "java -jar slave.jar" on slave agent Multijob01 hang waiting for finish task
            Hide
            ataylor Alex Taylor added a comment -

            So I would think this could happen even with a slight disconnect of the slave per Ivan's last comment.

            Is there any more information that can be provided here?

            Show
            ataylor Alex Taylor added a comment - So I would think this could happen even with a slight disconnect of the slave per Ivan's last comment. Is there any more information that can be provided here?
            Hide
            yorammi Yoram Michaeli added a comment -

            Closing issue as part of tikal-multijob-plugin issues cleanup.
            If still relevant, please open a matching issue in https://github.com/jenkinsci/tikal-multijob-plugin/issues (you can refer to this issue in its description)

            Show
            yorammi Yoram Michaeli added a comment - Closing issue as part of tikal-multijob-plugin issues cleanup. If still relevant, please open a matching issue in https://github.com/jenkinsci/tikal-multijob-plugin/issues (you can refer to this issue in its description)

              People

              • Assignee:
                chenc Chen Cohen
                Reporter:
                simonmweber Simon Weber
              • Votes:
                3 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: