Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27476

Plugin casue deadlock on Jenkins LTS 1.596.1

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Critical
    • Resolution: Duplicate
    • Component/s: durable-task-plugin
    • Labels:
      None
    • Environment:
      RHEL 6.x
      Jenkins LTS 1.596.1
      Durable Task Plugin 1.4
    • Similar Issues:

      Description

      Our Jenkins instance are getting locked up every day.
      It seems like this is due to the durable task plugin.

      Usins JConsole and connecting to the running java process I find dadlocks and gets this stacktrace:

      Name: Computer.threadPoolForRemoting [#179]
      State: BLOCKED on hudson.slaves.RetentionStrategy$Demand@1455cecd owned by: jenkins.util.Timer [#1]
      Total blocked: 26  Total waited: 522
      
      Stack trace: 
      hudson.slaves.RetentionStrategy$Demand.check(RetentionStrategy.java:212)
      hudson.slaves.RetentionStrategy$Demand.check(RetentionStrategy.java:172)
      hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:661)
      hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:120)
      hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:180)
         - locked java.lang.Object@68ed76f9
      jenkins.model.Jenkins.updateComputerList(Jenkins.java:1218)
      jenkins.model.Jenkins.setNodes(Jenkins.java:1714)
      jenkins.model.Jenkins.removeNode(Jenkins.java:1709)
         - locked hudson.model.Hudson@794217b7
      hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:65)
      org.jenkinsci.plugins.durabletask.executors.OnceRetentionStrategy$1.run(OnceRetentionStrategy.java:125)
         - locked hudson.model.Queue@5a25192e
      jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      java.util.concurrent.FutureTask.run(FutureTask.java:166)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      java.lang.Thread.run(Thread.java:722)
      

      We are running the Jenkins LTS version 1.596.1 and Durable Task Plugin 1.4.
      We also had this problem with Durable Task plugin 1.3.

      Running Durable Task plugin 1.2 on Jenkins LTS 1.580.3 seemd to work OK.

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            Stephen Connolly these are your changes; any idea?

            Show
            jglick Jesse Glick added a comment - Stephen Connolly these are your changes; any idea?
            Hide
            stephenconnolly Stephen Connolly added a comment -

            Ultimately fixing this is part of https://github.com/jenkinsci/jenkins/pull/1596

            If this is a CloudBees customer we have two hotfixes that seem to work around the deadlock with the side-effect of degrading UI performance

            Show
            stephenconnolly Stephen Connolly added a comment - Ultimately fixing this is part of https://github.com/jenkinsci/jenkins/pull/1596 If this is a CloudBees customer we have two hotfixes that seem to work around the deadlock with the side-effect of degrading UI performance
            Hide
            pablaasmo Per Arnold Blaasmo added a comment -

            Thanks for looking into this.
            Just for the record, my company is not currently a CloudBees customer.

            Some more info:
            To try to cope with the problem I downgraded to version 1.2 of the Durable Task Plugin.
            This seems to make things much more stable. We might still get a deadlock, but more seldom.

            Show
            pablaasmo Per Arnold Blaasmo added a comment - Thanks for looking into this. Just for the record, my company is not currently a CloudBees customer. Some more info: To try to cope with the problem I downgraded to version 1.2 of the Durable Task Plugin. This seems to make things much more stable. We might still get a deadlock, but more seldom.
            Hide
            stephenconnolly Stephen Connolly added a comment -

            You are awaiting this change in Jenkins core: https://github.com/stephenc/jenkins/blob/threadsafe-node-queue/core/src/main/java/hudson/slaves/ComputerRetentionWork.java

            You can work around it with a bit of Groovy script...

            Basically you need to create a sub-class of ComputerRetentionWork where the doRun method wraps a call to it's super.doRun and then modify the extension list for PeriodicWork, removing the old ComputerRetentionWork instance and adding an instance of your sub-class

            Show
            stephenconnolly Stephen Connolly added a comment - You are awaiting this change in Jenkins core: https://github.com/stephenc/jenkins/blob/threadsafe-node-queue/core/src/main/java/hudson/slaves/ComputerRetentionWork.java You can work around it with a bit of Groovy script... Basically you need to create a sub-class of ComputerRetentionWork where the doRun method wraps a call to it's super.doRun and then modify the extension list for PeriodicWork, removing the old ComputerRetentionWork instance and adding an instance of your sub-class
            Hide
            stephenconnolly Stephen Connolly added a comment -

            For cloudbees customers, the hotfix you want is hotfix-zd-23541

            Show
            stephenconnolly Stephen Connolly added a comment - For cloudbees customers, the hotfix you want is hotfix-zd-23541
            Hide
            pablaasmo Per Arnold Blaasmo added a comment -

            We still have deadlocks even though we downgraded the plugin. Seemingly not so often, but...
            The workaround you suggested seems a little "hairy" for me

            I guess I need to wait for the fix of the https://github.com/jenkinsci/jenkins/pull/1596 to finish.

            Show
            pablaasmo Per Arnold Blaasmo added a comment - We still have deadlocks even though we downgraded the plugin. Seemingly not so often, but... The workaround you suggested seems a little "hairy" for me I guess I need to wait for the fix of the https://github.com/jenkinsci/jenkins/pull/1596 to finish.
            Hide
            stephenconnolly Stephen Connolly added a comment -

            Basically this is what you want the retention work that you are running to look like:

            public class SynchronizedComputerRetentionWork extends ComputerRetentionWork {
            
                @Override
                protected void doRun() {
                    Queue.withLock(new Runnable() {
                        @Override
                        public void run() {
                            synchronized (Jenkins.getInstance()) {
                                SynchronizedComputerRetentionWork.super.doRun();
                            }
                        }
                    });
                }
            
            }
            

            and then you do something like

            Jenkins.getInstance().getExtensionList(PeriodicWork.class).remove(ComputerRetentionWork.class);
            Jenkins.getInstance().getExtensionList(PeriodicWork.class).add(new SynchronizedComputerRetentionWork());
            

            Now all the above is more Java than Groovy, so would need translating into Groovy. Then you just put it in your init.groovy and you are fine.

            As all this does is delegate back to the base method it would be "safe" if you forgot to remove it after upgrading to something with PR#1596 merged as the thread would simply get the two locks twice and they are re-entrant locks

            Show
            stephenconnolly Stephen Connolly added a comment - Basically this is what you want the retention work that you are running to look like: public class SynchronizedComputerRetentionWork extends ComputerRetentionWork { @Override protected void doRun() { Queue.withLock( new Runnable () { @Override public void run() { synchronized (Jenkins.getInstance()) { SynchronizedComputerRetentionWork. super .doRun(); } } }); } } and then you do something like Jenkins.getInstance().getExtensionList(PeriodicWork.class).remove(ComputerRetentionWork.class); Jenkins.getInstance().getExtensionList(PeriodicWork.class).add( new SynchronizedComputerRetentionWork()); Now all the above is more Java than Groovy, so would need translating into Groovy. Then you just put it in your init.groovy and you are fine. As all this does is delegate back to the base method it would be "safe" if you forgot to remove it after upgrading to something with PR#1596 merged as the thread would simply get the two locks twice and they are re-entrant locks
            Hide
            pablaasmo Per Arnold Blaasmo added a comment -

            @stephenconnolly, thanks. I will try to see if I can do this.
            I will report back about the result

            Show
            pablaasmo Per Arnold Blaasmo added a comment - @stephenconnolly, thanks. I will try to see if I can do this. I will report back about the result
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/Functions.java
            core/src/main/java/hudson/model/AbstractCIBase.java
            core/src/main/java/hudson/model/Computer.java
            core/src/main/java/hudson/model/Executor.java
            core/src/main/java/hudson/model/Hudson.java
            core/src/main/java/hudson/model/Node.java
            core/src/main/java/hudson/model/Queue.java
            core/src/main/java/hudson/model/ResourceController.java
            core/src/main/java/hudson/slaves/AbstractCloudSlave.java
            core/src/main/java/hudson/slaves/ComputerRetentionWork.java
            core/src/main/java/hudson/slaves/NodeProvisioner.java
            core/src/main/java/hudson/slaves/RetentionStrategy.java
            core/src/main/java/hudson/slaves/SlaveComputer.java
            core/src/main/java/jenkins/model/Jenkins.java
            core/src/main/java/jenkins/model/Nodes.java
            core/src/main/java/jenkins/util/AtmostOneTaskExecutor.java
            core/src/main/resources/hudson/model/Messages.properties
            core/src/main/resources/lib/hudson/executors.jelly
            core/src/main/resources/lib/layout/layout.jelly
            http://jenkins-ci.org/commit/jenkins/92147c3597308bc05e6448ccc41409fcc7c05fd7
            Log:
            [FIXED JENKINS-27565] Refactor the Queue and Nodes to use a consistent locking strategy

            The test system I set up to verify resolution of customer(s)' issues driving this change, required
            additional changes in order to fully resolve the issues at hand. As a result I am bundling these
            changes:

            • Moves nodes to being store in separate config files outside of the main config file (improves performance) [FIXED JENKINS-27562]
            • Makes the Jenkins is loading screen not block on the extensions loading lock [FIXED JENKINS-27563]
            • Removes race condition rendering the list of executors [FIXED JENKINS-27564] [FIXED JENKINS-15355]
            • Tidy up the locks that were causing deadlocks with the once retention strategy in durable tasks [FIXED JENKINS-27476]
            • Remove any requirement from Jenkins Core to lock on the Queue when rendering the Jenkins UI [FIXED-JENKINS-27566]
            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/Functions.java core/src/main/java/hudson/model/AbstractCIBase.java core/src/main/java/hudson/model/Computer.java core/src/main/java/hudson/model/Executor.java core/src/main/java/hudson/model/Hudson.java core/src/main/java/hudson/model/Node.java core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/ResourceController.java core/src/main/java/hudson/slaves/AbstractCloudSlave.java core/src/main/java/hudson/slaves/ComputerRetentionWork.java core/src/main/java/hudson/slaves/NodeProvisioner.java core/src/main/java/hudson/slaves/RetentionStrategy.java core/src/main/java/hudson/slaves/SlaveComputer.java core/src/main/java/jenkins/model/Jenkins.java core/src/main/java/jenkins/model/Nodes.java core/src/main/java/jenkins/util/AtmostOneTaskExecutor.java core/src/main/resources/hudson/model/Messages.properties core/src/main/resources/lib/hudson/executors.jelly core/src/main/resources/lib/layout/layout.jelly http://jenkins-ci.org/commit/jenkins/92147c3597308bc05e6448ccc41409fcc7c05fd7 Log: [FIXED JENKINS-27565] Refactor the Queue and Nodes to use a consistent locking strategy The test system I set up to verify resolution of customer(s)' issues driving this change, required additional changes in order to fully resolve the issues at hand. As a result I am bundling these changes: Moves nodes to being store in separate config files outside of the main config file (improves performance) [FIXED JENKINS-27562] Makes the Jenkins is loading screen not block on the extensions loading lock [FIXED JENKINS-27563] Removes race condition rendering the list of executors [FIXED JENKINS-27564] [FIXED JENKINS-15355] Tidy up the locks that were causing deadlocks with the once retention strategy in durable tasks [FIXED JENKINS-27476] Remove any requirement from Jenkins Core to lock on the Queue when rendering the Jenkins UI [FIXED-JENKINS-27566]
            Hide
            pablaasmo Per Arnold Blaasmo added a comment -

            As promised I would report back of the result of using the workaround.

            If made this code in the 'init.groovy' file based on the tips in this issue:

            import jenkins.model.Jenkins
            import java.util.logging.LogManager
            import hudson.model.PeriodicWork
            import hudson.slaves.ComputerRetentionWork
            
            def logger = LogManager.getLogManager().getLogger("")
            
            /* JENKINS_HOME environment variable is not reliable */
            def jenkinsHome = Jenkins.instance.getRootDir().absolutePath
            logger.info("RUNNING init.groovy from ${jenkinsHome}")
            
            logger.info("--> workaround for deadlock in durable task plugin")
            
            public class SynchronizedComputerRetentionWork extends ComputerRetentionWork {
            
                @Override
                protected void doRun() {
                    Queue.withLock(new Runnable() {
                        @Override
                        public void run() {
                            synchronized (Jenkins.getInstance()) {
                                SynchronizedComputerRetentionWork.super.doRun();
                            }
                        }
                    });
                }
            
            }
            
            Jenkins.getInstance().getExtensionList(PeriodicWork.class).remove(ComputerRetentionWork.class);
            Jenkins.getInstance().getExtensionList(PeriodicWork.class).add(new SynchronizedComputerRetentionWork());
            
            

            And the result is that I have not had any deadlocks the last 24 hours

            I also see that the JENKINS-27565 is fixed, so I will await a new LTS version with that bugfix included.

            Thank you for your help Stephen!

            Show
            pablaasmo Per Arnold Blaasmo added a comment - As promised I would report back of the result of using the workaround. If made this code in the 'init.groovy' file based on the tips in this issue: import jenkins.model.Jenkins import java.util.logging.LogManager import hudson.model.PeriodicWork import hudson.slaves.ComputerRetentionWork def logger = LogManager.getLogManager().getLogger("") /* JENKINS_HOME environment variable is not reliable */ def jenkinsHome = Jenkins.instance.getRootDir().absolutePath logger.info( "RUNNING init.groovy from ${jenkinsHome}" ) logger.info( "--> workaround for deadlock in durable task plugin" ) public class SynchronizedComputerRetentionWork extends ComputerRetentionWork { @Override protected void doRun() { Queue.withLock( new Runnable () { @Override public void run() { synchronized (Jenkins.getInstance()) { SynchronizedComputerRetentionWork. super .doRun(); } } }); } } Jenkins.getInstance().getExtensionList(PeriodicWork.class).remove(ComputerRetentionWork.class); Jenkins.getInstance().getExtensionList(PeriodicWork.class).add( new SynchronizedComputerRetentionWork()); And the result is that I have not had any deadlocks the last 24 hours I also see that the JENKINS-27565 is fixed, so I will await a new LTS version with that bugfix included. Thank you for your help Stephen!
            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: changelog.html http://jenkins-ci.org/commit/jenkins/46dc6850edb1d7ef52592794b15e69db7dfbed1a Log: Noting merges JENKINS-15355 JENKINS-21618 JENKINS-22941 JENKINS-25938 JENKINS-26391 JENKINS-26900 JENKINS-27476 JENKINS-27563 JENKINS-27564 JENKINS-27565 JENKINS-27566 Fixing link text for JENKINS-6167
            Hide
            jglick Jesse Glick added a comment -

            This is filed in a plugin and so by definition cannot be lts-candidate. Stephen Connolly what is its status?

            Show
            jglick Jesse Glick added a comment - This is filed in a plugin and so by definition cannot be lts-candidate . Stephen Connolly what is its status?
            Hide
            danielbeck Daniel Beck added a comment -

            Per Arnold Blaasmo Does this issue still occur in Jenkins 1.607 or higher, or can it be considered resolved?

            Show
            danielbeck Daniel Beck added a comment - Per Arnold Blaasmo Does this issue still occur in Jenkins 1.607 or higher, or can it be considered resolved?
            Hide
            jglick Jesse Glick added a comment -

            Closing as covered by the core fix unless I hear information to the contrary.

            Show
            jglick Jesse Glick added a comment - Closing as covered by the core fix unless I hear information to the contrary.
            Hide
            pablaasmo Per Arnold Blaasmo added a comment -

            I have not seen this issue again. So I think it is ok to close

            Show
            pablaasmo Per Arnold Blaasmo added a comment - I have not seen this issue again. So I think it is ok to close

              People

              • Assignee:
                stephenconnolly Stephen Connolly
                Reporter:
                pablaasmo Per Arnold Blaasmo
              • Votes:
                2 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: