Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27708

Concurrent build limits not honored on Jenkins 1.607

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      After upgrading to 1.607 we noticed the throttle plugin doesn't always prevent jobs from running in parallel as expected.

      Steps to reproduce:
      1. Create a throttle category with global limit 1, per-node limit 1.
      2. Create 3 jobs using the category, restricted to a node with two executors, sleep 60 as a build step.
      3. Request builds of all 3 jobs.

      What should happen:
      4. Jobs run in sequence; 1 then 2 then 3.

      What actually happens:
      5. Job 1 starts building, jobs 2 and 3 wait in queue (OK).
      6. After job 1 finishes, both job 2 and 3 start running (not OK).

      Plugin version is 1.8.4. Issue does not appear in Jenkins 1.606 with the same version of the plugin. Issue is reproducible on a fresh install.

      We were forced to downgrade back to 1.606 as a workaround (which is unfortunately not trivial due to JENKINS-27700).

        Attachments

          Issue Links

            Activity

            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            https://github.com/jenkinsci/jenkins/pull/1596 reworks thread synchronization approaches in the core. It may cause such issue.

            Show
            oleg_nenashev Oleg Nenashev added a comment - https://github.com/jenkinsci/jenkins/pull/1596 reworks thread synchronization approaches in the core. It may cause such issue.
            Hide
            stephenconnolly Stephen Connolly added a comment -

            Actually I suspect that https://github.com/jenkinsci/jenkins/pull/1596 just exposes problems with throttle concurrent builds plugin that were masked.

            Throttle concurrent builds "works" by searching for jobs that are running during the QueueTaskDispatcher.canTake evaluation. In Jenkins core, pre: https://github.com/jenkinsci/jenkins/blob/jenkins-1.606/core/src/main/java/hudson/model/Queue.java#L1178 and post: https://github.com/jenkinsci/jenkins/blob/jenkins-607/core/src/main/java/hudson/model/Queue.java#L1352 the QueueTaskDispatcher methods are called for all Buildable jobs in the Queue before mapping any of them against an executor.

            Thus https://github.com/jenkinsci/throttle-concurrent-builds-plugin/blob/throttle-concurrents-1.8.4/src/main/java/hudson/plugins/throttleconcurrents/ThrottleQueueTaskDispatcher.java#L194 will always fail to count jobs that are being picked off the queue at the same time.

            What the TCB plugin should probably be doing is either:

            The code change in 1.607 just removed some of the threading issues that allowed the TCB plugin to work most of the time by accident this bug represents a logical flaw in TCB that could still occur in 1.536-1.606 (most likely if running on a single core master) if the thread sequencing falls just right as Queue.maintain evaluates all projects in Buildable in one chunk and then maps them against executors.

            With 1.536-1.606 the executor threads could pick up the tasks while the Queue was in the process of dividing out the atomic set of work assignments, and thus - as long as the executor threads were scheduled fast enough - the TCB plugin would see the work assignments in the process of being assigned (but on a heavily loaded system or a single core system this issue would still occur)

            With 1.607 we ensure that the work assignment from Queue.maintain is mapped as an atomic unit, and thus the bug in TCB is exposed always.

            At least that is my analysis

            Show
            stephenconnolly Stephen Connolly added a comment - Actually I suspect that https://github.com/jenkinsci/jenkins/pull/1596 just exposes problems with throttle concurrent builds plugin that were masked. Throttle concurrent builds "works" by searching for jobs that are running during the QueueTaskDispatcher.canTake evaluation. In Jenkins core, pre: https://github.com/jenkinsci/jenkins/blob/jenkins-1.606/core/src/main/java/hudson/model/Queue.java#L1178 and post: https://github.com/jenkinsci/jenkins/blob/jenkins-607/core/src/main/java/hudson/model/Queue.java#L1352 the QueueTaskDispatcher methods are called for all Buildable jobs in the Queue before mapping any of them against an executor. Thus https://github.com/jenkinsci/throttle-concurrent-builds-plugin/blob/throttle-concurrents-1.8.4/src/main/java/hudson/plugins/throttleconcurrents/ThrottleQueueTaskDispatcher.java#L194 will always fail to count jobs that are being picked off the queue at the same time. What the TCB plugin should probably be doing is either: using ResourceLists https://github.com/jenkinsci/jenkins/blob/jenkins-1.606/core/src/main/java/hudson/model/ResourceController.java#L80 to prevent the build from actually starting or checking against both projects on or when counting builds on a node: https://github.com/jenkinsci/throttle-concurrent-builds-plugin/blob/throttle-concurrents-1.8.4/src/main/java/hudson/plugins/throttleconcurrents/ThrottleQueueTaskDispatcher.java#L194 it should actually be including also the Buildable projects against that computer https://github.com/jenkinsci/jenkins/blob/jenkins-1.607/core/src/main/java/hudson/model/Queue.java#L846 The code change in 1.607 just removed some of the threading issues that allowed the TCB plugin to work most of the time by accident this bug represents a logical flaw in TCB that could still occur in 1.536-1.606 (most likely if running on a single core master) if the thread sequencing falls just right as Queue.maintain evaluates all projects in Buildable in one chunk and then maps them against executors. With 1.536-1.606 the executor threads could pick up the tasks while the Queue was in the process of dividing out the atomic set of work assignments, and thus - as long as the executor threads were scheduled fast enough - the TCB plugin would see the work assignments in the process of being assigned (but on a heavily loaded system or a single core system this issue would still occur) With 1.607 we ensure that the work assignment from Queue.maintain is mapped as an atomic unit, and thus the bug in TCB is exposed always. At least that is my analysis
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Stephen, thanks for the analysis.
            I confirm the issue and hope to work on it soon if somebody does not take the issue. It seems to be a showstopper for jenkins-1.607+.

            Show
            oleg_nenashev Oleg Nenashev added a comment - Stephen, thanks for the analysis. I confirm the issue and hope to work on it soon if somebody does not take the issue. It seems to be a showstopper for jenkins-1.607+.
            Hide
            zowers Alexander Petrov added a comment -

            > It seems to be a showstopper for jenkins-1.607+.
            the issue's priority should be changed to blocker then

            Show
            zowers Alexander Petrov added a comment - > It seems to be a showstopper for jenkins-1.607+. the issue's priority should be changed to blocker then
            Hide
            danielbeck Daniel Beck added a comment -

            Could be worse. This is not breaking your Jenkins install (as in crashes).

            Show
            danielbeck Daniel Beck added a comment - Could be worse. This is not breaking your Jenkins install (as in crashes).
            Hide
            stephenconnolly Stephen Connolly added a comment -

            Can you see if the "hack" I have made in http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150410.091817-1.war resolves this issue for you?

            diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java
            index d776a57..6a5e006 100644
            --- a/core/src/main/java/hudson/model/Queue.java
            +++ b/core/src/main/java/hudson/model/Queue.java
            @@ -1335,7 +1335,10 @@ public class Queue extends ResourceController implements Saveable {
                         final QueueSorter s = sorter;
                         if (s != null)
                             s.sortBuildableItems(buildables);
            -
            +            
            +            // JENKINS-27708, JENKINS-27871, etc
            +            updateSnapshot();
            +            
                         // allocate buildable jobs to executors
                         for (BuildableItem p : new ArrayList<BuildableItem>(
                                 buildables)) {// copy as we'll mutate the list in the loop
            @@ -1372,6 +1375,9 @@ public class Queue extends ResourceController implements Saveable {
                                 makePending(p);
                             else
                                 LOGGER.log(Level.FINE, "BuildableItem {0} with empty work units!?", p);
            +                                
            +                // JENKINS-27708, JENKINS-27871, etc
            +                updateSnapshot();
                         }
                     } finally { updateSnapshot(); } } finally {
                         lock.unlock();
            
            Show
            stephenconnolly Stephen Connolly added a comment - Can you see if the "hack" I have made in http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150410.091817-1.war resolves this issue for you? diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java index d776a57..6a5e006 100644 --- a/core/src/main/java/hudson/model/Queue.java +++ b/core/src/main/java/hudson/model/Queue.java @@ -1335,7 +1335,10 @@ public class Queue extends ResourceController implements Saveable { final QueueSorter s = sorter; if (s != null ) s.sortBuildableItems(buildables); - + + // JENKINS-27708, JENKINS-27871, etc + updateSnapshot(); + // allocate buildable jobs to executors for (BuildableItem p : new ArrayList<BuildableItem>( buildables)) { // copy as we'll mutate the list in the loop @@ -1372,6 +1375,9 @@ public class Queue extends ResourceController implements Saveable { makePending(p); else LOGGER.log(Level.FINE, "BuildableItem {0} with empty work units!?" , p); + + // JENKINS-27708, JENKINS-27871, etc + updateSnapshot(); } } finally { updateSnapshot(); } } finally { lock.unlock();
            Hide
            tsniatowski Tomasz Śniatowski added a comment -

            It seems to help. When I run the steps from the bug description on the jenkins.war you posted I get correct behavior, the three builds run sequentially.

            Show
            tsniatowski Tomasz Śniatowski added a comment - It seems to help. When I run the steps from the bug description on the jenkins.war you posted I get correct behavior, the three builds run sequentially.
            Hide
            stephenconnolly Stephen Connolly added a comment -

            http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150411.073620-2.war is an alternative attempt. It would be good if you could confirm that it also resolves the issue.

            diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java
            index d776a57..3c9367a 100644
            --- a/core/src/main/java/hudson/model/Queue.java
            +++ b/core/src/main/java/hudson/model/Queue.java
            @@ -844,6 +844,17 @@ public class Queue extends ResourceController implements Saveable {
                  * Gets all the {@link BuildableItem}s that are waiting for an executor in the given {@link Computer}.
                  */
                 public List<BuildableItem> getBuildableItems(Computer c) {
            +        if (lock.tryLock()) {
            +            // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live
            +            try {
            +                List<BuildableItem> result = new ArrayList<BuildableItem>();
            +                _getBuildableItems(c, buildables, result);
            +                _getBuildableItems(c, pendings, result);
            +                return result;
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     Snapshot snapshot = this.snapshot;
                     List<BuildableItem> result = new ArrayList<BuildableItem>();
                     _getBuildableItems(c, snapshot.buildables, result);
            @@ -865,6 +876,16 @@ public class Queue extends ResourceController implements Saveable {
                  * Gets the snapshot of all {@link BuildableItem}s.
                  */
                 public List<BuildableItem> getBuildableItems() {
            +        if (lock.tryLock()) {
            +            // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live
            +            try {
            +                ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(buildables);
            +                r.addAll(pendings);
            +                return r;
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     Snapshot snapshot = this.snapshot;
                     ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(snapshot.buildables);
                     r.addAll(snapshot.pendings);
            @@ -875,6 +896,14 @@ public class Queue extends ResourceController implements Saveable {
                  * Gets the snapshot of all {@link BuildableItem}s.
                  */
                 public List<BuildableItem> getPendingItems() {
            +        if (lock.tryLock()) {
            +            // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live
            +            try {
            +                return new ArrayList<BuildableItem>(pendings);
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     return new ArrayList<BuildableItem>(snapshot.pendings);
                 }
             
            @@ -902,6 +931,19 @@ public class Queue extends ResourceController implements Saveable {
                  * @since 1.402
                  */
                 public List<Item> getUnblockedItems() {
            +        if (lock.tryLock()) {
            +            // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live
            +            try {
            +                List<Item> queuedNotBlocked = new ArrayList<Item>();
            +                queuedNotBlocked.addAll(waitingList);
            +                queuedNotBlocked.addAll(buildables);
            +                queuedNotBlocked.addAll(pendings);
            +                // but not 'blockedProjects'
            +                return queuedNotBlocked;
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     Snapshot snapshot = this.snapshot;
                     List<Item> queuedNotBlocked = new ArrayList<Item>();
                     queuedNotBlocked.addAll(snapshot.waitingList);
            @@ -928,6 +970,16 @@ public class Queue extends ResourceController implements Saveable {
                  * Is the given task currently pending execution?
                  */
                 public boolean isPending(Task t) {
            +        if (lock.tryLock()) {
            +            try {
            +                for (BuildableItem i : pendings)
            +                    if (i.task.equals(t))
            +                        return true;
            +                return false;
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     Snapshot snapshot = this.snapshot;
                     for (BuildableItem i : snapshot.pendings)
                         if (i.task.equals(t))
            

            If both work then I will fire each up on my test bed environment and see which passes the concurrency and lock stress... then we commit the simpler fix.

            Show
            stephenconnolly Stephen Connolly added a comment - http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150411.073620-2.war is an alternative attempt. It would be good if you could confirm that it also resolves the issue. diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java index d776a57..3c9367a 100644 --- a/core/src/main/java/hudson/model/Queue.java +++ b/core/src/main/java/hudson/model/Queue.java @@ -844,6 +844,17 @@ public class Queue extends ResourceController implements Saveable { * Gets all the {@link BuildableItem}s that are waiting for an executor in the given {@link Computer}. */ public List<BuildableItem> getBuildableItems(Computer c) { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + List<BuildableItem> result = new ArrayList<BuildableItem>(); + _getBuildableItems(c, buildables, result); + _getBuildableItems(c, pendings, result); + return result; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; List<BuildableItem> result = new ArrayList<BuildableItem>(); _getBuildableItems(c, snapshot.buildables, result); @@ -865,6 +876,16 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getBuildableItems() { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(buildables); + r.addAll(pendings); + return r; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(snapshot.buildables); r.addAll(snapshot.pendings); @@ -875,6 +896,14 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getPendingItems() { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + return new ArrayList<BuildableItem>(pendings); + } finally { + lock.unlock(); + } + } return new ArrayList<BuildableItem>(snapshot.pendings); } @@ -902,6 +931,19 @@ public class Queue extends ResourceController implements Saveable { * @since 1.402 */ public List<Item> getUnblockedItems() { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + List<Item> queuedNotBlocked = new ArrayList<Item>(); + queuedNotBlocked.addAll(waitingList); + queuedNotBlocked.addAll(buildables); + queuedNotBlocked.addAll(pendings); + // but not 'blockedProjects' + return queuedNotBlocked; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; List<Item> queuedNotBlocked = new ArrayList<Item>(); queuedNotBlocked.addAll(snapshot.waitingList); @@ -928,6 +970,16 @@ public class Queue extends ResourceController implements Saveable { * Is the given task currently pending execution? */ public boolean isPending(Task t) { + if (lock.tryLock()) { + try { + for (BuildableItem i : pendings) + if (i.task.equals(t)) + return true ; + return false ; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; for (BuildableItem i : snapshot.pendings) if (i.task.equals(t)) If both work then I will fire each up on my test bed environment and see which passes the concurrency and lock stress... then we commit the simpler fix.
            Hide
            tsniatowski Tomasz Śniatowski added a comment -

            Both appear to work for me.

            Show
            tsniatowski Tomasz Śniatowski added a comment - Both appear to work for me.
            Hide
            stephenconnolly Stephen Connolly added a comment -

            Here is one final version of a fix, presented for posterity - or if we need to revisit.

            diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java
            index d776a57..67ba711 100644
            --- a/core/src/main/java/hudson/model/Queue.java
            +++ b/core/src/main/java/hudson/model/Queue.java
            @@ -340,6 +340,25 @@ public class Queue extends ResourceController implements Saveable {
             
                 private transient final Condition condition = lock.newCondition();
             
            +    /**
            +     * There are some plugins that determine whether a project is blocked from execution on the basis of the 
            +     * jobs currently executing or currently pending execution. When we ask if a {@link Queue.Task} is blocked
            +     * for execution from {@link #maintain()}, we need to ensure that the {@link Queue.Task#isBuildBlocked(Item)}
            +     * is returning a result based on the live information of jobs scheduled rather than a result based on the
            +     * most recent {@link #snapshot}. This field holds the identity of the current thread that is running 
            +     * {@link #maintain()} and thus we can switch to returning live results when called from that thread.
            +     * 
            +     * Two alternative fixes for this class of issue (JENKINS-22708 and JENKINS-27871) could also be used:
            +     * <ul>
            +     *     <li>Update the {@link #snapshot}</li> after each and every {@link Queue.Task} is assigned for execution.</li>
            +     *     <li>Have {@link #maintain()} return after finding one {@link Queue.Task} to execute.</li>
            +     * </ul>
            +     * @since 1.FIXME
            +     * @see #liveResultRequired() 
            +     */
            +    @CheckForNull
            +    private transient volatile Thread maintainThread = null;
            +
                 public Queue(@Nonnull LoadBalancer loadBalancer) {
                     this.loadBalancer =  loadBalancer.sanitize();
                     // if all the executors are busy doing something, then the queue won't be maintained in
            @@ -844,6 +863,19 @@ public class Queue extends ResourceController implements Saveable {
                  * Gets all the {@link BuildableItem}s that are waiting for an executor in the given {@link Computer}.
                  */
                 public List<BuildableItem> getBuildableItems(Computer c) {
            +        if (liveResultRequired() && lock.tryLock()) {
            +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
            +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
            +            // of the current thread being the thread running maintain().
            +            try {
            +                List<BuildableItem> result = new ArrayList<BuildableItem>();
            +                _getBuildableItems(c, buildables, result);
            +                _getBuildableItems(c, pendings, result);
            +                return result;
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     Snapshot snapshot = this.snapshot;
                     List<BuildableItem> result = new ArrayList<BuildableItem>();
                     _getBuildableItems(c, snapshot.buildables, result);
            @@ -865,6 +897,18 @@ public class Queue extends ResourceController implements Saveable {
                  * Gets the snapshot of all {@link BuildableItem}s.
                  */
                 public List<BuildableItem> getBuildableItems() {
            +        if (liveResultRequired() && lock.tryLock()) {
            +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
            +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
            +            // of the current thread being the thread running maintain().
            +            try {
            +                ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(buildables);
            +                r.addAll(pendings);
            +                return r;
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     Snapshot snapshot = this.snapshot;
                     ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(snapshot.buildables);
                     r.addAll(snapshot.pendings);
            @@ -875,6 +919,16 @@ public class Queue extends ResourceController implements Saveable {
                  * Gets the snapshot of all {@link BuildableItem}s.
                  */
                 public List<BuildableItem> getPendingItems() {
            +        if (liveResultRequired() && lock.tryLock()) {
            +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
            +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
            +            // of the current thread being the thread running maintain().
            +            try {
            +                return new ArrayList<BuildableItem>(pendings);
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     return new ArrayList<BuildableItem>(snapshot.pendings);
                 }
             
            @@ -902,6 +956,21 @@ public class Queue extends ResourceController implements Saveable {
                  * @since 1.402
                  */
                 public List<Item> getUnblockedItems() {
            +        if (liveResultRequired() && lock.tryLock()) {
            +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
            +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
            +            // of the current thread being the thread running maintain().
            +            try {
            +                List<Item> queuedNotBlocked = new ArrayList<Item>();
            +                queuedNotBlocked.addAll(waitingList);
            +                queuedNotBlocked.addAll(buildables);
            +                queuedNotBlocked.addAll(pendings);
            +                // but not 'blockedProjects'
            +                return queuedNotBlocked;
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     Snapshot snapshot = this.snapshot;
                     List<Item> queuedNotBlocked = new ArrayList<Item>();
                     queuedNotBlocked.addAll(snapshot.waitingList);
            @@ -928,6 +997,19 @@ public class Queue extends ResourceController implements Saveable {
                  * Is the given task currently pending execution?
                  */
                 public boolean isPending(Task t) {
            +        if (liveResultRequired() && lock.tryLock()) {
            +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
            +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
            +            // of the current thread being the thread running maintain().
            +            try {
            +                for (BuildableItem i : pendings)
            +                    if (i.task.equals(t))
            +                        return true;
            +                return false;
            +            } finally {
            +                lock.unlock();
            +            }
            +        }
                     Snapshot snapshot = this.snapshot;
                     for (BuildableItem i : snapshot.pendings)
                         if (i.task.equals(t))
            @@ -1250,6 +1332,18 @@ public class Queue extends ResourceController implements Saveable {
                 }
             
                 /**
            +     * Some functionality that determines whether builds are blocked depending on what builds are currently running
            +     * needs live information rather than information from the most recently updated snapshot. (See
            +     * JENKINS-22708 and JENKINS-27871)
            +     * @return {@code true} if the current thread is running {@link #maintain()} and thus live results should be
            +     * returned rather than results from {@link #snapshot}
            +     * @since 1.FIXME
            +     */
            +    private boolean liveResultRequired() {
            +        return Thread.currentThread() == maintainThread;
            +    }
            +    
            +    /**
                  * Queue maintenance.
                  *
                  * <p>
            @@ -1264,6 +1358,7 @@ public class Queue extends ResourceController implements Saveable {
                 public void maintain() {
                     lock.lock();
                     try { try {
            +            maintainThread = Thread.currentThread();
             
                         LOGGER.log(Level.FINE, "Queue maintenance started {0}", this);
             
            @@ -1373,7 +1468,7 @@ public class Queue extends ResourceController implements Saveable {
                             else
                                 LOGGER.log(Level.FINE, "BuildableItem {0} with empty work units!?", p);
                         }
            -        } finally { updateSnapshot(); } } finally {
            +        } finally { maintainThread = null; updateSnapshot(); } } finally {
                         lock.unlock();
                     }
                 }
            

            I am going to apply Occam's razor and apply a slightly tweaked version of the first fix to core. Jobs are not scheduled for execution at a particularly onerous rate such that the creation of a new snapshot should not cause undue GC pressure, plus the snapshot itself should be easy for GC to clean up and should not really ever end up being promoted out of the eden pool, so simplest fix works

            Show
            stephenconnolly Stephen Connolly added a comment - Here is one final version of a fix, presented for posterity - or if we need to revisit. diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java index d776a57..67ba711 100644 --- a/core/src/main/java/hudson/model/Queue.java +++ b/core/src/main/java/hudson/model/Queue.java @@ -340,6 +340,25 @@ public class Queue extends ResourceController implements Saveable { private transient final Condition condition = lock.newCondition(); + /** + * There are some plugins that determine whether a project is blocked from execution on the basis of the + * jobs currently executing or currently pending execution. When we ask if a {@link Queue.Task} is blocked + * for execution from {@link #maintain()}, we need to ensure that the {@link Queue.Task#isBuildBlocked(Item)} + * is returning a result based on the live information of jobs scheduled rather than a result based on the + * most recent {@link #snapshot}. This field holds the identity of the current thread that is running + * {@link #maintain()} and thus we can switch to returning live results when called from that thread. + * + * Two alternative fixes for this class of issue (JENKINS-22708 and JENKINS-27871) could also be used: + * <ul> + * <li>Update the {@link #snapshot}</li> after each and every {@link Queue.Task} is assigned for execution.</li> + * <li>Have {@link #maintain()} return after finding one {@link Queue.Task} to execute.</li> + * </ul> + * @since 1.FIXME + * @see #liveResultRequired() + */ + @CheckForNull + private transient volatile Thread maintainThread = null ; + public Queue(@Nonnull LoadBalancer loadBalancer) { this .loadBalancer = loadBalancer.sanitize(); // if all the executors are busy doing something, then the queue won't be maintained in @@ -844,6 +863,19 @@ public class Queue extends ResourceController implements Saveable { * Gets all the {@link BuildableItem}s that are waiting for an executor in the given {@link Computer}. */ public List<BuildableItem> getBuildableItems(Computer c) { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + List<BuildableItem> result = new ArrayList<BuildableItem>(); + _getBuildableItems(c, buildables, result); + _getBuildableItems(c, pendings, result); + return result; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; List<BuildableItem> result = new ArrayList<BuildableItem>(); _getBuildableItems(c, snapshot.buildables, result); @@ -865,6 +897,18 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getBuildableItems() { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(buildables); + r.addAll(pendings); + return r; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(snapshot.buildables); r.addAll(snapshot.pendings); @@ -875,6 +919,16 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getPendingItems() { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + return new ArrayList<BuildableItem>(pendings); + } finally { + lock.unlock(); + } + } return new ArrayList<BuildableItem>(snapshot.pendings); } @@ -902,6 +956,21 @@ public class Queue extends ResourceController implements Saveable { * @since 1.402 */ public List<Item> getUnblockedItems() { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + List<Item> queuedNotBlocked = new ArrayList<Item>(); + queuedNotBlocked.addAll(waitingList); + queuedNotBlocked.addAll(buildables); + queuedNotBlocked.addAll(pendings); + // but not 'blockedProjects' + return queuedNotBlocked; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; List<Item> queuedNotBlocked = new ArrayList<Item>(); queuedNotBlocked.addAll(snapshot.waitingList); @@ -928,6 +997,19 @@ public class Queue extends ResourceController implements Saveable { * Is the given task currently pending execution? */ public boolean isPending(Task t) { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + for (BuildableItem i : pendings) + if (i.task.equals(t)) + return true ; + return false ; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; for (BuildableItem i : snapshot.pendings) if (i.task.equals(t)) @@ -1250,6 +1332,18 @@ public class Queue extends ResourceController implements Saveable { } /** + * Some functionality that determines whether builds are blocked depending on what builds are currently running + * needs live information rather than information from the most recently updated snapshot. (See + * JENKINS-22708 and JENKINS-27871) + * @ return {@code true } if the current thread is running {@link #maintain()} and thus live results should be + * returned rather than results from {@link #snapshot} + * @since 1.FIXME + */ + private boolean liveResultRequired() { + return Thread .currentThread() == maintainThread; + } + + /** * Queue maintenance. * * <p> @@ -1264,6 +1358,7 @@ public class Queue extends ResourceController implements Saveable { public void maintain() { lock.lock(); try { try { + maintainThread = Thread .currentThread(); LOGGER.log(Level.FINE, "Queue maintenance started {0}" , this ); @@ -1373,7 +1468,7 @@ public class Queue extends ResourceController implements Saveable { else LOGGER.log(Level.FINE, "BuildableItem {0} with empty work units!?" , p); } - } finally { updateSnapshot(); } } finally { + } finally { maintainThread = null ; updateSnapshot(); } } finally { lock.unlock(); } } I am going to apply Occam's razor and apply a slightly tweaked version of the first fix to core. Jobs are not scheduled for execution at a particularly onerous rate such that the creation of a new snapshot should not cause undue GC pressure, plus the snapshot itself should be easy for GC to clean up and should not really ever end up being promoted out of the eden pool, so simplest fix works
            Hide
            stephenconnolly Stephen Connolly added a comment -

            It appears we have missed the boat for 1.609...

            Show
            stephenconnolly Stephen Connolly added a comment - It appears we have missed the boat for 1.609...
            Hide
            stephenconnolly Stephen Connolly added a comment -

            https://github.com/jenkinsci/jenkins/pull/1645 and this should hopefully land for 1.610

            Show
            stephenconnolly Stephen Connolly added a comment - https://github.com/jenkinsci/jenkins/pull/1645 and this should hopefully land for 1.610
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Reassigned the issue to Stephen

            Show
            oleg_nenashev Oleg Nenashev added a comment - Reassigned the issue to Stephen
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/model/Queue.java
            http://jenkins-ci.org/commit/jenkins/5880ed830201f9349ae9def6653c19a186e1eb18
            Log:
            [FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state.

            • The creation of a snapshot itself should be relatively cheap given the expected rate of
              job execution. You probably would need 100's of jobs starting execution every iteration
              of maintain() before this could even start to become an issue and likely the calculation
              of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally
              since the snapshot itself only ever has at most one reference originating outside of the stack
              it should remain in the eden space and thus be cheap to GC.
            • JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of
              this issue. I am favouring this approach as it is simpler and provides less scope for error as any
              new helper methods can just rely on the snapshot being up to date whereas with the other
              two candidates if a new helper method is introduced there is the potential to miss adding support
              for the live view. The comment 225819 has the risk of introducing extra lock contention while
              the comment 225906 version forces every access to the helper methods to pass a second memory
              barrier
            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java http://jenkins-ci.org/commit/jenkins/5880ed830201f9349ae9def6653c19a186e1eb18 Log: [FIXED JENKINS-27708] [FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state. The creation of a snapshot itself should be relatively cheap given the expected rate of job execution. You probably would need 100's of jobs starting execution every iteration of maintain() before this could even start to become an issue and likely the calculation of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally since the snapshot itself only ever has at most one reference originating outside of the stack it should remain in the eden space and thus be cheap to GC. JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of this issue. I am favouring this approach as it is simpler and provides less scope for error as any new helper methods can just rely on the snapshot being up to date whereas with the other two candidates if a new helper method is introduced there is the potential to miss adding support for the live view. The comment 225819 has the risk of introducing extra lock contention while the comment 225906 version forces every access to the helper methods to pass a second memory barrier
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/model/Queue.java
            test/src/test/java/hudson/model/QueueTest.java
            http://jenkins-ci.org/commit/jenkins/152d00ad09931c10f02fab4ac8a42e574d622bd3
            Log:
            Merge pull request #1645 from stephenc/jenkins-27708

            JENKINS-27708, JENKINS-27871 Ensure that identification of blocked tasks is using the live state.

            Compare: https://github.com/jenkinsci/jenkins/compare/b688047877cf...152d00ad0993

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/152d00ad09931c10f02fab4ac8a42e574d622bd3 Log: Merge pull request #1645 from stephenc/jenkins-27708 JENKINS-27708 , JENKINS-27871 Ensure that identification of blocked tasks is using the live state. Compare: https://github.com/jenkinsci/jenkins/compare/b688047877cf...152d00ad0993
            Hide
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #4087

            Result = SUCCESS

            Show
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #4087 Result = SUCCESS
            Hide
            alanbirtles Alan Birtles added a comment -

            Did this make it into 1.610?

            Show
            alanbirtles Alan Birtles added a comment - Did this make it into 1.610?
            Show
            danielbeck Daniel Beck added a comment - Yes, see: https://github.com/jenkinsci/jenkins/commit/152d00ad09931c10f02fab4ac8a42e574d622bd3 Unfortunately, https://github.com/jenkinsci/jenkins/pull/1645#issuecomment-94141001
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Marking as fixed

            Show
            oleg_nenashev Oleg Nenashev added a comment - Marking as fixed
            Hide
            mjdetullio Matthew DeTullio added a comment -

            I think with this latest approach there's some issues when both the Build Blocker Plugin and Throttle Concurrent Builds Plugin are used in conjunction. I've had two instances now where builds get stuck in the queue when nothing that is currently executing should be blocking them.

            Hovering over the queue items, they do not offer a reason for the blockage – only a start reason and waiting time in queue.

            Trying to find a consistent method to reproduce and scope of the issue, but regardless our use of the Build Blocker Plugin in this instance is legacy and due to be removed.

            Show
            mjdetullio Matthew DeTullio added a comment - I think with this latest approach there's some issues when both the Build Blocker Plugin and Throttle Concurrent Builds Plugin are used in conjunction. I've had two instances now where builds get stuck in the queue when nothing that is currently executing should be blocking them. Hovering over the queue items, they do not offer a reason for the blockage – only a start reason and waiting time in queue. Trying to find a consistent method to reproduce and scope of the issue, but regardless our use of the Build Blocker Plugin in this instance is legacy and due to be removed.
            Hide
            costescuandrei Andrei Costescu added a comment -

            It seems to be fixed for me as well (I reported duplicate JENKINS-28084).

            But now the same (as in above comment) happened to me multiple times: jobs get stuck from time to time in queue. Nothing else running, but they just won't start. If I cancel ('x' button) them from queue and start them again manually they do start.

            I don't have a Build Blocker Plugin installed.

            Can I check something more when I see this happening again to pinpoint the cause?
            Should we reopen or create separate bug report?

            Show
            costescuandrei Andrei Costescu added a comment - It seems to be fixed for me as well (I reported duplicate JENKINS-28084 ). But now the same (as in above comment) happened to me multiple times: jobs get stuck from time to time in queue. Nothing else running, but they just won't start. If I cancel ('x' button) them from queue and start them again manually they do start. I don't have a Build Blocker Plugin installed. Can I check something more when I see this happening again to pinpoint the cause? Should we reopen or create separate bug report?
            Hide
            dvh_yxlon Dirk von Husen added a comment -

            Seems like we have similar issues with only BuildBlocker (not the throttle concurrent builds plugin) installed. But we´re still running on 1610. Just looks like queued blocked builds are not triggert (or the regex check) again after one blocking build finished.

            Show
            dvh_yxlon Dirk von Husen added a comment - Seems like we have similar issues with only BuildBlocker (not the throttle concurrent builds plugin) installed. But we´re still running on 1610. Just looks like queued blocked builds are not triggert (or the regex check) again after one blocking build finished.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/model/Queue.java
            http://jenkins-ci.org/commit/jenkins/11b89d9c0b46f0afad0ca86a11264b503b799c2e
            Log:
            [FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state.

            • The creation of a snapshot itself should be relatively cheap given the expected rate of
              job execution. You probably would need 100's of jobs starting execution every iteration
              of maintain() before this could even start to become an issue and likely the calculation
              of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally
              since the snapshot itself only ever has at most one reference originating outside of the stack
              it should remain in the eden space and thus be cheap to GC.
            • JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of
              this issue. I am favouring this approach as it is simpler and provides less scope for error as any
              new helper methods can just rely on the snapshot being up to date whereas with the other
              two candidates if a new helper method is introduced there is the potential to miss adding support
              for the live view. The comment 225819 has the risk of introducing extra lock contention while
              the comment 225906 version forces every access to the helper methods to pass a second memory
              barrier

            (cherry picked from commit 5880ed830201f9349ae9def6653c19a186e1eb18)

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java http://jenkins-ci.org/commit/jenkins/11b89d9c0b46f0afad0ca86a11264b503b799c2e Log: [FIXED JENKINS-27708] [FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state. The creation of a snapshot itself should be relatively cheap given the expected rate of job execution. You probably would need 100's of jobs starting execution every iteration of maintain() before this could even start to become an issue and likely the calculation of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally since the snapshot itself only ever has at most one reference originating outside of the stack it should remain in the eden space and thus be cheap to GC. JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of this issue. I am favouring this approach as it is simpler and provides less scope for error as any new helper methods can just rely on the snapshot being up to date whereas with the other two candidates if a new helper method is introduced there is the potential to miss adding support for the live view. The comment 225819 has the risk of introducing extra lock contention while the comment 225906 version forces every access to the helper methods to pass a second memory barrier (cherry picked from commit 5880ed830201f9349ae9def6653c19a186e1eb18)
            Hide
            costescuandrei Andrei Costescu added a comment -

            Dirk, did you find a solution to the new problem - builds not starting when they should?
            I didn't update Jenkins recently to avoid hitting new bugs.

            Show
            costescuandrei Andrei Costescu added a comment - Dirk, did you find a solution to the new problem - builds not starting when they should? I didn't update Jenkins recently to avoid hitting new bugs.
            Hide
            dvh_yxlon Dirk von Husen added a comment - - edited

            No, did not find a solution so far. I am going to create an issue here soon. Since we were still using an older version, I wanted to update first, check if the problem is still there and then come back to this issue.
            Unfortunately the problem still exists, even after the update to version 1.613.

            -> please notice JENKINS-28376

            Show
            dvh_yxlon Dirk von Husen added a comment - - edited No, did not find a solution so far. I am going to create an issue here soon. Since we were still using an older version, I wanted to update first, check if the problem is still there and then come back to this issue. Unfortunately the problem still exists, even after the update to version 1.613. -> please notice JENKINS-28376
            Hide
            costescuandrei Andrei Costescu added a comment -

            I added my vote to it. Thanks.

            Show
            costescuandrei Andrei Costescu added a comment - I added my vote to it. Thanks.
            Hide
            stephenconnolly Stephen Connolly added a comment -

            This is fundamentally an issue in t-c-b-p making incorrect assumptions

            Show
            stephenconnolly Stephen Connolly added a comment - This is fundamentally an issue in t-c-b-p making incorrect assumptions
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Andrei Costescu , Stephen Connolly, Dirk von Husen
            Is it a subject for a new issue?

            Show
            oleg_nenashev Oleg Nenashev added a comment - Andrei Costescu , Stephen Connolly , Dirk von Husen Is it a subject for a new issue?
            Hide
            costescuandrei Andrei Costescu added a comment -

            Dirk already created JENKINS-28376 for the new problem.
            Don't know how related it is in code to the current one.

            Show
            costescuandrei Andrei Costescu added a comment - Dirk already created JENKINS-28376 for the new problem. Don't know how related it is in code to the current one.
            Hide
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #4292
            [FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state. (Revision 11b89d9c0b46f0afad0ca86a11264b503b799c2e)

            Result = UNSTABLE
            ogondza : 11b89d9c0b46f0afad0ca86a11264b503b799c2e
            Files :

            • core/src/main/java/hudson/model/Queue.java
            Show
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #4292 [FIXED JENKINS-27708] [FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state. (Revision 11b89d9c0b46f0afad0ca86a11264b503b799c2e) Result = UNSTABLE ogondza : 11b89d9c0b46f0afad0ca86a11264b503b799c2e Files : core/src/main/java/hudson/model/Queue.java

              People

              • Assignee:
                oleg_nenashev Oleg Nenashev
                Reporter:
                tsniatowski Tomasz Śniatowski
              • Votes:
                2 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: