Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-28926

Jenkins queue self-locking without apparent reason?

    Details

    • Similar Issues:

      Description

      Since some weeks ago we are experimenting some problems with the jenkins queue.

      While looking for dupes before creating this... I've found a bunch of issues, similar, but I'm not sure if any of them are the very same issue than this, because they often comment about various plugins we are not using at all). Here it's a brief list of those "similar" issues, just in case, at the end, all them are the same problem: JENKINS-28532, JENKINS-28887, JENKINS-28136, JENKINS-28376, JENKINS-28690...

      One thing in common for all them is that they are really recent and it seems to be common that, whatever the problem is, it started around 1.611. While I don't have the exact version for our case (coz we update continuously) I'd say it started happening also recently here.

      Description:

      We have 2 jenkins server, a public one (linux) and a private/testing (mac) one. And we are experimenting the same problem in both. This is the URL of the public one:

      http://integration.moodle.org

      There we have some "chains" of free-form jobs, with all the jobs having both the "Block build when upstream project is building" and "Block build when downstream project is building" settings ticked.

      The first job is always a git-pull-changes one and it starts the "chain" whenever changes are detected in the target branch. We have one chain for every supported branch.

      And this has been working since ages ago (years). If for any reason a job was manually launched or the scheduled (every 5 minutes) git job detected new changes... it never has been a problem. Those new jobs were there, in the queue, waiting for the current "chain" to finish. And, once finished, the queue handling was clever enough to detect the 1st job to execute from it, also deleting dupes or whatever was needed.

      Basically, the summary is that it never became stuck, no matter how new jobs were in the queue or how they had landed to it (manually or automatically). So far, perfect.

      But, since some versions ago.. that has changed drastically. Now, if we add manually jobs to the queue, of if multiple changes are detected in a short period of time... those jobs in the queue correctly wait for the current "chain" to end (like they used to do, can be viewed hovering over elements). But once the chain has ended, the queue is not able to decide any job to start with, and it became "locked" forever.

      Right now, if you go to the server above... you'll see that there are 4 jobs, all them belonging to the "master" view/branch/chain, awaiting in the queue... never launched and, worse, preventing new runs in that branch to happen. And the hover information does not show any waiting cause (screenshots added, showing both manually added jobs when the chain was running and automatic jobs, any of them with a reason for the locking, as far as all the executors are idle).

      And those self-locks are really having an impact here, because it's transforming our "continuous automatic integration" experience into a "wow, we have not run tests for master since 2 days ago, wtf, let's kill the queue manually and process all changes together, grrr" thing. Sure you get it, lol.

      Those servers and chains have been working perfectly since the night of the times and, while we are using various plugins for notification, conditional builds and so on, it seems that the way the queue handles jobs using the core "Block build..." settings has changed recently, leading easily (both manually & automated changes) to some horrible locks.

      Constantly. And it's a recent "change of behavior". I'm not sure if it's ok to call it a "bug" (although I feel inclined to think that), but can ensure that it's hurting our integration experience here.

      Finally, we are reproducing this behavior with both 1.617 (testing server) and older 1.613 (public server).

      Ciao and thanks for all the hard work, you rock

        Attachments

          Issue Links

            Activity

            stronk7 Eloy Lafuente created issue -
            stronk7 Eloy Lafuente made changes -
            Field Original Value New Value
            Attachment self_locked_automatic.png [ 29963 ]
            Attachment self_locked_manual.png [ 29964 ]
            stronk7 Eloy Lafuente made changes -
            Description Since some weeks ago we are experimenting some problems with the jenkins queue.

            While looking for dupes before creating this... I've found a bunch of issues, similar, but I'm not sure if any of them are the very same issue than this, because they often comment about various plugins we are not using at all). Here it's a brief list of those "similar" issues, just in case, at the end, all them are the same problem: JENKINS-28532, JENKINS-28887, JENKINS-28136, JENKINS-28376, JENKINS-28690...

            One thing in common for all them is that they are really recent and it seems to be common that, whatever the problem is, it started around 1.611. While I don't have the exact version for our case (coz we update continuously) I'd say it started happening also recently here.

            Description:

            We have 2 jenkins server, a public one (linux) and a private/testing (mac) one. And we are experimenting the same problem in both. This is the URL of the public one:

            http://integration.moodle.org

            There we have some "chains" of free-form jobs, with all the jobs having both the "Block build when upstream project is building" and "Block build when downstream project is building" settings ticked.

            The first job is always a git-pull-changes one and it starts the "chain" whenever changes are detected in the target branch. We have one chain for every supported branch.

            And this has been working since ages ago (years). If for any reason a job was manually launched or the scheduled (every 5 minutes) git job detected new changes... it never has been a problem. Those new jobs were there, in the queue, waiting for the current "chain" to finish. And, once finished, the queue handling was clever enough to detect the 1st job to execute from it, also deleting dupes or whatever was needed.

            Basically, the summary is that it never become stuck, no matter how new jobs were in the queue or how they had landed to it (manually or automatically). So far, perfect.

            But, since some versions ago.. that has changed drastically. Now, if we add manually jobs to the queue, of if multiple changes are detected in a short period of time... those jobs in the queue correctly wait for the current "chain" to end (like they used to do, can be viewed hovering over elements). But once the chain has ended, the queue is not able to decide any job to start with, and it became "locked" forever.

            Right now, if you go to the server above... you'll see that there are 4 jobs, all them belonging to the "master" view/branch/chain, awaiting in the queue... never launched and, worse, preventing new runs in that branch to happen. And the hover information does not show any waiting cause (screenshots added, showing both manually added jobs when the chain was running and automatic jobs, any of them with a reason for the locking, as far as all the executors are idle).

            And those self-locks are really having an impact here, because it's transforming our "continuous automatic integration" experience into a "wow, we have not run tests for master since 2 days ago, wtf, let's kill the queue manually and process all changes together, grrr" thing. Sure you get it, lol.

            Those servers and chains have been working perfectly since the night of the times and, while we are using various plugins for notification, conditional builds and so on, it seems that the way the queue handles jobs using the core "Block build..." settings has changed recently, leading easily (both manually & automated changes) to some horrible locks.

            Constantly. And it's a recent "change of behavior". I'm not sure if it's ok to call it a "bug" (although I feel inclined to think that), but can ensure that it's hurting our integration experience here.

            Finally, we are reproducing this behavior with both 1.617 (public server) and older 1.613 (testing server).

            Ciao and thanks for all the hard work, you rock :)
            Since some weeks ago we are experimenting some problems with the jenkins queue.

            While looking for dupes before creating this... I've found a bunch of issues, similar, but I'm not sure if any of them are the very same issue than this, because they often comment about various plugins we are not using at all). Here it's a brief list of those "similar" issues, just in case, at the end, all them are the same problem: JENKINS-28532, JENKINS-28887, JENKINS-28136, JENKINS-28376, JENKINS-28690...

            One thing in common for all them is that they are really recent and it seems to be common that, whatever the problem is, it started around 1.611. While I don't have the exact version for our case (coz we update continuously) I'd say it started happening also recently here.

            Description:

            We have 2 jenkins server, a public one (linux) and a private/testing (mac) one. And we are experimenting the same problem in both. This is the URL of the public one:

            http://integration.moodle.org

            There we have some "chains" of free-form jobs, with all the jobs having both the "Block build when upstream project is building" and "Block build when downstream project is building" settings ticked.

            The first job is always a git-pull-changes one and it starts the "chain" whenever changes are detected in the target branch. We have one chain for every supported branch.

            And this has been working since ages ago (years). If for any reason a job was manually launched or the scheduled (every 5 minutes) git job detected new changes... it never has been a problem. Those new jobs were there, in the queue, waiting for the current "chain" to finish. And, once finished, the queue handling was clever enough to detect the 1st job to execute from it, also deleting dupes or whatever was needed.

            Basically, the summary is that it never become stuck, no matter how new jobs were in the queue or how they had landed to it (manually or automatically). So far, perfect.

            But, since some versions ago.. that has changed drastically. Now, if we add manually jobs to the queue, of if multiple changes are detected in a short period of time... those jobs in the queue correctly wait for the current "chain" to end (like they used to do, can be viewed hovering over elements). But once the chain has ended, the queue is not able to decide any job to start with, and it became "locked" forever.

            Right now, if you go to the server above... you'll see that there are 4 jobs, all them belonging to the "master" view/branch/chain, awaiting in the queue... never launched and, worse, preventing new runs in that branch to happen. And the hover information does not show any waiting cause (screenshots added, showing both manually added jobs when the chain was running and automatic jobs, any of them with a reason for the locking, as far as all the executors are idle).

            And those self-locks are really having an impact here, because it's transforming our "continuous automatic integration" experience into a "wow, we have not run tests for master since 2 days ago, wtf, let's kill the queue manually and process all changes together, grrr" thing. Sure you get it, lol.

            Those servers and chains have been working perfectly since the night of the times and, while we are using various plugins for notification, conditional builds and so on, it seems that the way the queue handles jobs using the core "Block build..." settings has changed recently, leading easily (both manually & automated changes) to some horrible locks.

            Constantly. And it's a recent "change of behavior". I'm not sure if it's ok to call it a "bug" (although I feel inclined to think that), but can ensure that it's hurting our integration experience here.

            Finally, we are reproducing this behavior with both 1.617 (testing server) and older 1.613 (public server).

            Ciao and thanks for all the hard work, you rock :)
            stronk7 Eloy Lafuente made changes -
            Description Since some weeks ago we are experimenting some problems with the jenkins queue.

            While looking for dupes before creating this... I've found a bunch of issues, similar, but I'm not sure if any of them are the very same issue than this, because they often comment about various plugins we are not using at all). Here it's a brief list of those "similar" issues, just in case, at the end, all them are the same problem: JENKINS-28532, JENKINS-28887, JENKINS-28136, JENKINS-28376, JENKINS-28690...

            One thing in common for all them is that they are really recent and it seems to be common that, whatever the problem is, it started around 1.611. While I don't have the exact version for our case (coz we update continuously) I'd say it started happening also recently here.

            Description:

            We have 2 jenkins server, a public one (linux) and a private/testing (mac) one. And we are experimenting the same problem in both. This is the URL of the public one:

            http://integration.moodle.org

            There we have some "chains" of free-form jobs, with all the jobs having both the "Block build when upstream project is building" and "Block build when downstream project is building" settings ticked.

            The first job is always a git-pull-changes one and it starts the "chain" whenever changes are detected in the target branch. We have one chain for every supported branch.

            And this has been working since ages ago (years). If for any reason a job was manually launched or the scheduled (every 5 minutes) git job detected new changes... it never has been a problem. Those new jobs were there, in the queue, waiting for the current "chain" to finish. And, once finished, the queue handling was clever enough to detect the 1st job to execute from it, also deleting dupes or whatever was needed.

            Basically, the summary is that it never become stuck, no matter how new jobs were in the queue or how they had landed to it (manually or automatically). So far, perfect.

            But, since some versions ago.. that has changed drastically. Now, if we add manually jobs to the queue, of if multiple changes are detected in a short period of time... those jobs in the queue correctly wait for the current "chain" to end (like they used to do, can be viewed hovering over elements). But once the chain has ended, the queue is not able to decide any job to start with, and it became "locked" forever.

            Right now, if you go to the server above... you'll see that there are 4 jobs, all them belonging to the "master" view/branch/chain, awaiting in the queue... never launched and, worse, preventing new runs in that branch to happen. And the hover information does not show any waiting cause (screenshots added, showing both manually added jobs when the chain was running and automatic jobs, any of them with a reason for the locking, as far as all the executors are idle).

            And those self-locks are really having an impact here, because it's transforming our "continuous automatic integration" experience into a "wow, we have not run tests for master since 2 days ago, wtf, let's kill the queue manually and process all changes together, grrr" thing. Sure you get it, lol.

            Those servers and chains have been working perfectly since the night of the times and, while we are using various plugins for notification, conditional builds and so on, it seems that the way the queue handles jobs using the core "Block build..." settings has changed recently, leading easily (both manually & automated changes) to some horrible locks.

            Constantly. And it's a recent "change of behavior". I'm not sure if it's ok to call it a "bug" (although I feel inclined to think that), but can ensure that it's hurting our integration experience here.

            Finally, we are reproducing this behavior with both 1.617 (testing server) and older 1.613 (public server).

            Ciao and thanks for all the hard work, you rock :)
            Since some weeks ago we are experimenting some problems with the jenkins queue.

            While looking for dupes before creating this... I've found a bunch of issues, similar, but I'm not sure if any of them are the very same issue than this, because they often comment about various plugins we are not using at all). Here it's a brief list of those "similar" issues, just in case, at the end, all them are the same problem: JENKINS-28532, JENKINS-28887, JENKINS-28136, JENKINS-28376, JENKINS-28690...

            One thing in common for all them is that they are really recent and it seems to be common that, whatever the problem is, it started around 1.611. While I don't have the exact version for our case (coz we update continuously) I'd say it started happening also recently here.

            Description:

            We have 2 jenkins server, a public one (linux) and a private/testing (mac) one. And we are experimenting the same problem in both. This is the URL of the public one:

            http://integration.moodle.org

            There we have some "chains" of free-form jobs, with all the jobs having both the "Block build when upstream project is building" and "Block build when downstream project is building" settings ticked.

            The first job is always a git-pull-changes one and it starts the "chain" whenever changes are detected in the target branch. We have one chain for every supported branch.

            And this has been working since ages ago (years). If for any reason a job was manually launched or the scheduled (every 5 minutes) git job detected new changes... it never has been a problem. Those new jobs were there, in the queue, waiting for the current "chain" to finish. And, once finished, the queue handling was clever enough to detect the 1st job to execute from it, also deleting dupes or whatever was needed.

            Basically, the summary is that it never became stuck, no matter how new jobs were in the queue or how they had landed to it (manually or automatically). So far, perfect.

            But, since some versions ago.. that has changed drastically. Now, if we add manually jobs to the queue, of if multiple changes are detected in a short period of time... those jobs in the queue correctly wait for the current "chain" to end (like they used to do, can be viewed hovering over elements). But once the chain has ended, the queue is not able to decide any job to start with, and it became "locked" forever.

            Right now, if you go to the server above... you'll see that there are 4 jobs, all them belonging to the "master" view/branch/chain, awaiting in the queue... never launched and, worse, preventing new runs in that branch to happen. And the hover information does not show any waiting cause (screenshots added, showing both manually added jobs when the chain was running and automatic jobs, any of them with a reason for the locking, as far as all the executors are idle).

            And those self-locks are really having an impact here, because it's transforming our "continuous automatic integration" experience into a "wow, we have not run tests for master since 2 days ago, wtf, let's kill the queue manually and process all changes together, grrr" thing. Sure you get it, lol.

            Those servers and chains have been working perfectly since the night of the times and, while we are using various plugins for notification, conditional builds and so on, it seems that the way the queue handles jobs using the core "Block build..." settings has changed recently, leading easily (both manually & automated changes) to some horrible locks.

            Constantly. And it's a recent "change of behavior". I'm not sure if it's ok to call it a "bug" (although I feel inclined to think that), but can ensure that it's hurting our integration experience here.

            Finally, we are reproducing this behavior with both 1.617 (testing server) and older 1.613 (public server).

            Ciao and thanks for all the hard work, you rock :)
            Hide
            danielbeck Daniel Beck added a comment -

            There we have some "chains" of free-form jobs, with all the jobs having both the "Block build when upstream project is building" and "Block build when downstream project is building" settings ticked.

            The jobs in the queue block each other.

            Show
            danielbeck Daniel Beck added a comment - There we have some "chains" of free-form jobs, with all the jobs having both the "Block build when upstream project is building" and "Block build when downstream project is building" settings ticked. The jobs in the queue block each other.
            Hide
            stronk7 Eloy Lafuente added a comment -

            The jobs in the queue block each other.

            1) There is not information in the "hover" information about those blockings at all (see screenshots).
            2) Until recently, that did not happen ever. The queue was clever enough to pick the first (in the chain) job and start executing it. That's the point. It has become "silly", lol.

            Ciao

            Show
            stronk7 Eloy Lafuente added a comment - The jobs in the queue block each other. 1) There is not information in the "hover" information about those blockings at all (see screenshots). 2) Until recently, that did not happen ever. The queue was clever enough to pick the first (in the chain) job and start executing it. That's the point. It has become "silly", lol. Ciao
            Hide
            danielbeck Daniel Beck added a comment -

            Uh… yeah that wasn't supposed to be the only text in the comment. Sorry about that.

            So yes, it's a problem. The feature was (IMO) pretty bad to begin with, now it's a joke. Will probably need to exclude queued items from being considered, so something will be able to escape the deadlock.

            Show
            danielbeck Daniel Beck added a comment - Uh… yeah that wasn't supposed to be the only text in the comment. Sorry about that. So yes, it's a problem. The feature was (IMO) pretty bad to begin with, now it's a joke. Will probably need to exclude queued items from being considered, so something will be able to escape the deadlock.
            Hide
            danielbeck Daniel Beck added a comment -

            Assigning Stephen Connolly, asking for feedback. WDYT? Looks like this was introduced by the queue fixes.

            Show
            danielbeck Daniel Beck added a comment - Assigning Stephen Connolly , asking for feedback. WDYT? Looks like this was introduced by the queue fixes.
            danielbeck Daniel Beck made changes -
            Assignee stephenconnolly [ stephenconnolly ]
            Hide
            stronk7 Eloy Lafuente added a comment -

            (no worries, thanks for the instant feedback!)

            Show
            stronk7 Eloy Lafuente added a comment - (no worries, thanks for the instant feedback!)
            stephenconnolly Stephen Connolly made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            stephenconnolly Stephen Connolly made changes -
            Labels lts-candidate queue regression
            Show
            stephenconnolly Stephen Connolly added a comment - https://github.com/jenkinsci/jenkins/pull/1743
            Hide
            stephenconnolly Stephen Connolly added a comment -

            towards 1.618

            Show
            stephenconnolly Stephen Connolly added a comment - towards 1.618
            stephenconnolly Stephen Connolly made changes -
            Status In Progress [ 3 ] Resolved [ 5 ]
            Resolution Fixed [ 1 ]
            Hide
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #4187
            [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete (Revision de87736795898e57f7aca140124c2b1a3d1daf40)
            JENKINS-28926 Adding test case (Revision c44c088442e1821f8cd44f4fdaa146d94dd85910)

            Result = UNSTABLE
            stephen connolly : de87736795898e57f7aca140124c2b1a3d1daf40
            Files :

            • core/src/main/java/hudson/model/Queue.java
            • core/src/main/java/hudson/model/queue/QueueSorter.java

            stephen connolly : c44c088442e1821f8cd44f4fdaa146d94dd85910
            Files :

            • test/src/test/java/hudson/model/QueueTest.java
            Show
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #4187 [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete (Revision de87736795898e57f7aca140124c2b1a3d1daf40) JENKINS-28926 Adding test case (Revision c44c088442e1821f8cd44f4fdaa146d94dd85910) Result = UNSTABLE stephen connolly : de87736795898e57f7aca140124c2b1a3d1daf40 Files : core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/queue/QueueSorter.java stephen connolly : c44c088442e1821f8cd44f4fdaa146d94dd85910 Files : test/src/test/java/hudson/model/QueueTest.java
            Hide
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #4188
            JENKINS-28926 Noting merge of #1743 (Revision a208dfeac886d67d805505546e49ae52940a191e)
            JENKINS-28926 Tidy-up TODO for the Java 7+ Jenkins versions (Revision 8c5b9cd008a4d0fb30dc39d9ee1bd72b95b199f2)

            Result = UNSTABLE
            stephen connolly : a208dfeac886d67d805505546e49ae52940a191e
            Files :

            • core/src/main/java/hudson/model/queue/QueueSorter.java
            • changelog.html

            stephen connolly : 8c5b9cd008a4d0fb30dc39d9ee1bd72b95b199f2
            Files :

            • core/src/main/java/hudson/model/Queue.java
            Show
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #4188 JENKINS-28926 Noting merge of #1743 (Revision a208dfeac886d67d805505546e49ae52940a191e) JENKINS-28926 Tidy-up TODO for the Java 7+ Jenkins versions (Revision 8c5b9cd008a4d0fb30dc39d9ee1bd72b95b199f2) Result = UNSTABLE stephen connolly : a208dfeac886d67d805505546e49ae52940a191e Files : core/src/main/java/hudson/model/queue/QueueSorter.java changelog.html stephen connolly : 8c5b9cd008a4d0fb30dc39d9ee1bd72b95b199f2 Files : core/src/main/java/hudson/model/Queue.java
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/model/Queue.java
            core/src/main/java/hudson/model/queue/QueueSorter.java
            http://jenkins-ci.org/commit/jenkins/de87736795898e57f7aca140124c2b1a3d1daf40
            Log:
            [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete

            • One could argue that without this change the system is functioning correctly and that previous behaviour
              was a bug. On the other hand, people have come to rely on the previous behaviour.
            • The issue really centeres around state changes in the blocked tasks. Since blocking on upstream/downstream
              relies on checking the building projects and the queued (excluding blocked) tasks we need any change in
              the blocked task list to be visible immediately (i.e. update the snapshot)
            • I was able to reliably reproduce this behaviour with a convoluted set of manually configured projects
              but turning this into a test case has not proved quite as easy. Manual testing confirms that the issue is
              fixed for my manual test case
            • I have also added a sorting of the blocked list when probing for tasks to unblock. This should prioritise
              tasks as intended by the QueueSorter
            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/queue/QueueSorter.java http://jenkins-ci.org/commit/jenkins/de87736795898e57f7aca140124c2b1a3d1daf40 Log: [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete One could argue that without this change the system is functioning correctly and that previous behaviour was a bug. On the other hand, people have come to rely on the previous behaviour. The issue really centeres around state changes in the blocked tasks. Since blocking on upstream/downstream relies on checking the building projects and the queued (excluding blocked) tasks we need any change in the blocked task list to be visible immediately (i.e. update the snapshot) I was able to reliably reproduce this behaviour with a convoluted set of manually configured projects but turning this into a test case has not proved quite as easy. Manual testing confirms that the issue is fixed for my manual test case I have also added a sorting of the blocked list when probing for tasks to unblock. This should prioritise tasks as intended by the QueueSorter
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            test/src/test/java/hudson/model/QueueTest.java
            http://jenkins-ci.org/commit/jenkins/c44c088442e1821f8cd44f4fdaa146d94dd85910
            Log:
            JENKINS-28926 Adding test case

            • I was forgetting the call to `rebuildDependencyGraph()` which was why the test didn't work for me
            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/c44c088442e1821f8cd44f4fdaa146d94dd85910 Log: JENKINS-28926 Adding test case I was forgetting the call to `rebuildDependencyGraph()` which was why the test didn't work for me
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/model/Queue.java
            core/src/main/java/hudson/model/queue/QueueSorter.java
            test/src/test/java/hudson/model/QueueTest.java
            http://jenkins-ci.org/commit/jenkins/7929412037ff75f60791cfb23631521f8726c23d
            Log:
            Merge pull request #1743 from stephenc/jenkins-28926

            [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete

            Compare: https://github.com/jenkinsci/jenkins/compare/482bffa9cb91...7929412037ff

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/queue/QueueSorter.java test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/7929412037ff75f60791cfb23631521f8726c23d Log: Merge pull request #1743 from stephenc/jenkins-28926 [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete Compare: https://github.com/jenkinsci/jenkins/compare/482bffa9cb91...7929412037ff
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            changelog.html
            core/src/main/java/hudson/model/queue/QueueSorter.java
            http://jenkins-ci.org/commit/jenkins/a208dfeac886d67d805505546e49ae52940a191e
            Log:
            JENKINS-28926 Noting merge of #1743

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: changelog.html core/src/main/java/hudson/model/queue/QueueSorter.java http://jenkins-ci.org/commit/jenkins/a208dfeac886d67d805505546e49ae52940a191e Log: JENKINS-28926 Noting merge of #1743
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/model/Queue.java
            http://jenkins-ci.org/commit/jenkins/8c5b9cd008a4d0fb30dc39d9ee1bd72b95b199f2
            Log:
            JENKINS-28926 Tidy-up TODO for the Java 7+ Jenkins versions

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java http://jenkins-ci.org/commit/jenkins/8c5b9cd008a4d0fb30dc39d9ee1bd72b95b199f2 Log: JENKINS-28926 Tidy-up TODO for the Java 7+ Jenkins versions
            Hide
            leedega Kevin Phillips added a comment -

            Can anyone here confirm whether this fix will address the problem described here?

            Further, I have confirmed the problem (at least as described under JENKINS-28513) is reproducible on the latest LTS release as well (v1.609.1). We just recently finishing rolling out this LTS edition into production on our build farm yesterday and have had numerous cases already where this bug is affecting our production teams.

            As it stands we've had severely detrimental effects on our development teams as a result of this defect so the sooner it can be backported the better!

            Show
            leedega Kevin Phillips added a comment - Can anyone here confirm whether this fix will address the problem described here ? Further, I have confirmed the problem (at least as described under JENKINS-28513 ) is reproducible on the latest LTS release as well (v1.609.1). We just recently finishing rolling out this LTS edition into production on our build farm yesterday and have had numerous cases already where this bug is affecting our production teams. As it stands we've had severely detrimental effects on our development teams as a result of this defect so the sooner it can be backported the better!
            Hide
            stronk7 Eloy Lafuente added a comment -

            Not knowing anything about internals... if your jobs stayed in the queue forever, never being picked for build... and without any "this is blocked by xxxxx hover" I'd say the fix here may solve your situation (as far as I understood the discussion @ github it precisely avoids those deadlocks in the queue "without cause").

            But be noted, I can be 200% wrong, it's just a supposition, based in the "symptom" being the same I experimented here (no matter i do not use the build-blocker-plugin, but the core "Block upstream/downstream" settings instead.

            Sure once 1.618 is out we'll easily know the answer. Ciao

            Show
            stronk7 Eloy Lafuente added a comment - Not knowing anything about internals... if your jobs stayed in the queue forever, never being picked for build... and without any "this is blocked by xxxxx hover" I'd say the fix here may solve your situation (as far as I understood the discussion @ github it precisely avoids those deadlocks in the queue "without cause"). But be noted, I can be 200% wrong, it's just a supposition, based in the "symptom" being the same I experimented here (no matter i do not use the build-blocker-plugin, but the core "Block upstream/downstream" settings instead. Sure once 1.618 is out we'll easily know the answer. Ciao
            Hide
            leedega Kevin Phillips added a comment -

            Since we have experienced severe regression problems with every single Jenkins upgrade we have ever performed, we now have a sandbox environment setup for testing new versions (although apparently our test environment is insufficient to catch all problems since we still managed to miss this one).

            I only mention that here because I can probably test out 1.618 fairly quickly to see if I can reproduce the problem on our particular configuration, which I would be happy to do if it means we can get the fix backported sooner.

            Just let me know if I can help.

            Show
            leedega Kevin Phillips added a comment - Since we have experienced severe regression problems with every single Jenkins upgrade we have ever performed, we now have a sandbox environment setup for testing new versions (although apparently our test environment is insufficient to catch all problems since we still managed to miss this one). I only mention that here because I can probably test out 1.618 fairly quickly to see if I can reproduce the problem on our particular configuration, which I would be happy to do if it means we can get the fix backported sooner. Just let me know if I can help.
            Hide
            stephenconnolly Stephen Connolly added a comment -

            FYI if you are stuck, killing one of the deadlocked threads (i.e. calling Thread.stop() on the one with Queue.maintain() ) from the Groovy console will repair your instance without restarting it.

            We have a CloudBees hotfix for this issue (sadly for CloudBees customers) that does just that, i.e. periodically checks for this type of deadlock and kills the one with Queue.maintain() as that is the safe one to kill.

            All the test scenarios we could come up with to reproduce these type of deadlocks do not give rise to deadlocks on 1.618 (but do deadlock 1.617)... doesn't mean that Kevin Phillips's deadlock is the same... it may be a different deadlock... providing the stack trace of the deadlocked threads is the easiest way to confirm/deny

            Show
            stephenconnolly Stephen Connolly added a comment - FYI if you are stuck, killing one of the deadlocked threads (i.e. calling Thread.stop() on the one with Queue.maintain() ) from the Groovy console will repair your instance without restarting it. We have a CloudBees hotfix for this issue (sadly for CloudBees customers) that does just that, i.e. periodically checks for this type of deadlock and kills the one with Queue.maintain() as that is the safe one to kill. All the test scenarios we could come up with to reproduce these type of deadlocks do not give rise to deadlocks on 1.618 (but do deadlock 1.617)... doesn't mean that Kevin Phillips 's deadlock is the same... it may be a different deadlock... providing the stack trace of the deadlocked threads is the easiest way to confirm/deny
            dvh_yxlon Dirk von Husen made changes -
            Link This issue is related to JENKINS-28376 [ JENKINS-28376 ]
            danielbeck Daniel Beck made changes -
            Link This issue is duplicated by JENKINS-29028 [ JENKINS-29028 ]
            olivergondza Oliver Gondža made changes -
            Labels lts-candidate queue regression 1.609.2-fixed queue regression
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/model/Queue.java
            core/src/main/java/hudson/model/queue/QueueSorter.java
            http://jenkins-ci.org/commit/jenkins/4f4a64a522ec7bf31f24280827757214e6985f3d
            Log:
            [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete

            • One could argue that without this change the system is functioning correctly and that previous behaviour
              was a bug. On the other hand, people have come to rely on the previous behaviour.
            • The issue really centeres around state changes in the blocked tasks. Since blocking on upstream/downstream
              relies on checking the building projects and the queued (excluding blocked) tasks we need any change in
              the blocked task list to be visible immediately (i.e. update the snapshot)
            • I was able to reliably reproduce this behaviour with a convoluted set of manually configured projects
              but turning this into a test case has not proved quite as easy. Manual testing confirms that the issue is
              fixed for my manual test case
            • I have also added a sorting of the blocked list when probing for tasks to unblock. This should prioritise
              tasks as intended by the QueueSorter

            (cherry picked from commit de87736795898e57f7aca140124c2b1a3d1daf40)

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/queue/QueueSorter.java http://jenkins-ci.org/commit/jenkins/4f4a64a522ec7bf31f24280827757214e6985f3d Log: [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete One could argue that without this change the system is functioning correctly and that previous behaviour was a bug. On the other hand, people have come to rely on the previous behaviour. The issue really centeres around state changes in the blocked tasks. Since blocking on upstream/downstream relies on checking the building projects and the queued (excluding blocked) tasks we need any change in the blocked task list to be visible immediately (i.e. update the snapshot) I was able to reliably reproduce this behaviour with a convoluted set of manually configured projects but turning this into a test case has not proved quite as easy. Manual testing confirms that the issue is fixed for my manual test case I have also added a sorting of the blocked list when probing for tasks to unblock. This should prioritise tasks as intended by the QueueSorter (cherry picked from commit de87736795898e57f7aca140124c2b1a3d1daf40)
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            test/src/test/java/hudson/model/QueueTest.java
            http://jenkins-ci.org/commit/jenkins/8596004024e9d8a00a99c459b4d7c82c004d1724
            Log:
            JENKINS-28926 Adding test case

            • I was forgetting the call to `rebuildDependencyGraph()` which was why the test didn't work for me

            (cherry picked from commit c44c088442e1821f8cd44f4fdaa146d94dd85910)

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/8596004024e9d8a00a99c459b4d7c82c004d1724 Log: JENKINS-28926 Adding test case I was forgetting the call to `rebuildDependencyGraph()` which was why the test didn't work for me (cherry picked from commit c44c088442e1821f8cd44f4fdaa146d94dd85910)
            Hide
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #4292
            [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete (Revision 4f4a64a522ec7bf31f24280827757214e6985f3d)
            JENKINS-28926 Adding test case (Revision 8596004024e9d8a00a99c459b4d7c82c004d1724)

            Result = UNSTABLE
            ogondza : 4f4a64a522ec7bf31f24280827757214e6985f3d
            Files :

            • core/src/main/java/hudson/model/queue/QueueSorter.java
            • core/src/main/java/hudson/model/Queue.java

            ogondza : 8596004024e9d8a00a99c459b4d7c82c004d1724
            Files :

            • test/src/test/java/hudson/model/QueueTest.java
            Show
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #4292 [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete (Revision 4f4a64a522ec7bf31f24280827757214e6985f3d) JENKINS-28926 Adding test case (Revision 8596004024e9d8a00a99c459b4d7c82c004d1724) Result = UNSTABLE ogondza : 4f4a64a522ec7bf31f24280827757214e6985f3d Files : core/src/main/java/hudson/model/queue/QueueSorter.java core/src/main/java/hudson/model/Queue.java ogondza : 8596004024e9d8a00a99c459b4d7c82c004d1724 Files : test/src/test/java/hudson/model/QueueTest.java
            stephenconnolly Stephen Connolly made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 163790 ] JNJira + In-Review [ 208882 ]

              People

              • Assignee:
                stephenconnolly Stephen Connolly
                Reporter:
                stronk7 Eloy Lafuente
              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: