Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59793

Possible thread leak 'QueueSubTaskMetrics' in metrics - Allow finishing builds when SubTask badly fail

    Details

    • Similar Issues:
    • Released As:
      jenkins-2.205

      Description

      In one large instance we can see 3000 threads with this shape:

       

      "QueueSubTaskMetrics [#11342]" #5508106 daemon prio=5 os_prio=0 tid=0x00007efcf085a800 nid=0x52c7 in Object.wait() [0x00007efbccb32000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:75) - locked <0x0000000512c01d10> (a hudson.model.queue.FutureImpl) at jenkins.metrics.impl.JenkinsMetricProviderImpl.lambda$asSupplier$3(JenkinsMetricProviderImpl.java:1142) at jenkins.metrics.impl.JenkinsMetricProviderImpl$$Lambda$388/1851215464.get(Unknown Source) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000004dc401fb0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

       

      There is way less number of jobs running or waiting (130) and the number of these threads never decreases.

      After investigating the code:

      • Metrics listens to the queue with a QueueListener creating threads with the name QueueSubTaskMetrics
      • The 3000 threads shown are all waiting for a future to be completed.
      while(!completed)
         wait();

      I think it could be avoided calling the get method with a timeout instead of getting blocked there: https://github.com/jenkinsci/metrics-plugin/commit/e803bc3b82b54bfe27e66a6393dedf53bdf1896e#diff-b02885f3ba6b4982b5322b73e664c0b6R1049

       

       

        Attachments

          Issue Links

            Activity

            mramonleon Ramon Leon created issue -
            Hide
            danielbeck Daniel Beck added a comment -

            Interesting, that might have been a source of too many threads on ci.jenkins.io recently as well.

            Show
            danielbeck Daniel Beck added a comment - Interesting, that might have been a source of too many threads on ci.jenkins.io recently as well.
            batmat Baptiste Mathus made changes -
            Field Original Value New Value
            Priority Major [ 3 ] Critical [ 2 ]
            Hide
            raihaan Raihaan Shouhell added a comment -

            Confirming that my instances get the same issue on Jenkins 2.176.2 with Metrics 4.0.2.4

            Show
            raihaan Raihaan Shouhell added a comment - Confirming that my instances get the same issue on Jenkins 2.176.2 with Metrics 4.0.2.4
            Hide
            mramonleon Ramon Leon added a comment -

            Another approach, I think the best candidate, but it requires fixing Jenkins core:

            I’m pondering whether we could solve the problem of so many QueueSubTaskMetrics threads waiting by ensuring the synchronizeEnd and synchronizeStart always set the future? For example by surrounding the code of these two methods with try-catch and setting the future in the finally clause before rethrowing the exception. I think this because if an exception is thrown in these methods, the result is not set, so the thread is always waiting because the complete flag is not set.

            WorkUnitContext.java#L132

            But this is a fix on Jenkins core, not in Metrics plugin.

            Show
            mramonleon Ramon Leon added a comment - Another approach, I think the best candidate, but it requires fixing Jenkins core: I’m pondering whether we could solve the problem of so many QueueSubTaskMetrics threads waiting by ensuring the synchronizeEnd and synchronizeStart always set the future? For example by surrounding the code of these two methods with try-catch and setting the future in the finally clause before rethrowing the exception. I think this because if an exception is thrown in these methods, the result is not set, so the thread is always waiting because the complete flag is not set. WorkUnitContext.java#L132 But this is a fix on Jenkins core, not in Metrics plugin.
            mramonleon Ramon Leon made changes -
            Assignee Ramon Leon [ mramonleon ]
            mramonleon Ramon Leon made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            Hide
            olivergondza Oliver Gondža added a comment -

            We are being impacted by this as well.

            Ramon Leon I believe the mistake is in the metrics plugin, that seems to presumes that when QueueListener#onLeft is called, its task will start eventually. On a first glance it appears this is not true when the task is canceled while in queue. There might be other reasons I am not aware of, too.

            Show
            olivergondza Oliver Gondža added a comment - We are being impacted by this as well. Ramon Leon I believe the mistake is in the metrics plugin, that seems to presumes that when QueueListener#onLeft is called, its task will start eventually. On a first glance it appears this is not true when the task is canceled while in queue. There might be other reasons I am not aware of, too.
            foundation_security_members CloudBees Foundation Security made changes -
            Comment [ I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks! ]
            Hide
            mramonleon Ramon Leon added a comment -

            I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks!

            Show
            mramonleon Ramon Leon added a comment - I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks!
            Hide
            mramonleon Ramon Leon added a comment -

            I’m trying to cancel jobs while in the Queue (cancel) or once running (getExecutor().interrupt()), but in this way, the job is correctly finished so the metrics thread is freed. I need to figure out a way to kill a job in a way that the future is not set in the WorkUnitContext: https://github.com/jenkinsci/jenkins/blob/264980c707c6cafc01369378b1e90d247561df78/core/src/main/java/hudson/model/queue/WorkUnitContext.java#L132

            Most likely playing with several executors and subtasks and interrupting some of them.

            Still working, any clue on how to achieve that is welcome Daniel Beck Jesse Glick Oleg Nenashev

            Show
            mramonleon Ramon Leon added a comment - I’m trying to cancel jobs while in the Queue ( cancel ) or once running ( getExecutor().interrupt() ), but in this way, the job is correctly finished so the metrics thread is freed. I need to figure out a way to kill a job in a way that the future is not set in the WorkUnitContext:  https://github.com/jenkinsci/jenkins/blob/264980c707c6cafc01369378b1e90d247561df78/core/src/main/java/hudson/model/queue/WorkUnitContext.java#L132 Most likely playing with several executors and subtasks and interrupting some of them. Still working, any clue on how to achieve that is welcome Daniel Beck Jesse Glick Oleg Nenashev
            Hide
            mramonleon Ramon Leon added a comment -

            I've managed to reproduce the problem in a test. I schedule a freestyle project with a long SleepBuilder and you add a SubTaskContributor throwing an exception in the run method of the Executable created. Then the WorkUnitContext#synchronizeEnd method exists with an InterruptedException and the future.set is not called. In addition, the next abort call also exists in if (aborted!=null) return; so the endLatch is not aborted properly. With this scenario, the wuc.future keeps waiting forever in the JenkinsMetricProviderImpl#asSupplier method.

            From my point of view the synchronizeEnd should guarantee the future is set even when throwing the InterruptedException, or the abort method shouldn't exit without aborting the latches.

            Show
            mramonleon Ramon Leon added a comment - I've managed to reproduce the problem in a test. I schedule a freestyle project with a long SleepBuilder and you add a SubTaskContributor throwing an exception in the run method of the Executable created. Then the WorkUnitContext#synchronizeEnd method exists with an InterruptedException and the future.set is not called. In addition, the next abort call also exists in if (aborted!=null) return; so the endLatch is not aborted properly. With this scenario, the wuc.future keeps waiting forever in the JenkinsMetricProviderImpl#asSupplier  method. From my point of view the synchronizeEnd should guarantee the future is set even when throwing the InterruptedException, or the abort method shouldn't exit without aborting the latches.
            Hide
            mramonleon Ramon Leon added a comment - - edited

            If the exception is thrown in a Builder the wrong behavior is not showed up:

            FreeStyleProject p = j.createProject(FreeStyleProject.class);
            
            p.getBuildersList().add(new TestBuilder() {
             @Override
             public boolean perform(AbstractBuild<?, ?> build, Launcher launcher, BuildListener listener) throws InterruptedException {
             listener.getLogger().println("running failing builder");
             throw new IndexOutOfBoundsException("Unexpected exception while building");
             }
            });

            It happens because when it fails and the code enters the synchronizeEnd method, and then the endLatch.synchronize(), it exits without throwing an exception, which is different from when you get here from a SubTask#run

            Show
            mramonleon Ramon Leon added a comment - - edited If the exception is thrown in a Builder the wrong behavior is not showed up: FreeStyleProject p = j.createProject(FreeStyleProject.class); p.getBuildersList().add( new TestBuilder() { @Override public boolean perform(AbstractBuild<?, ?> build, Launcher launcher, BuildListener listener) throws InterruptedException { listener.getLogger().println( "running failing builder" ); throw new IndexOutOfBoundsException( "Unexpected exception while building" ); } }); It happens because when it fails and the code enters the synchronizeEnd method, and then the endLatch.synchronize() , it exits without throwing an exception, which is different from when you get here from a SubTask#run
            Show
            mramonleon Ramon Leon added a comment - PR:   https://github.com/jenkinsci/jenkins/pull/4346
            mramonleon Ramon Leon made changes -
            Remote Link This issue links to "PR #4346 (Web Link)" [ 23945 ]
            mramonleon Ramon Leon made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            Hide
            mramonleon Ramon Leon added a comment -

            I've pushed the test failing, after the CI fails I will push the fix

            Show
            mramonleon Ramon Leon added a comment - I've pushed the test failing, after the CI fails I will push the fix
            Hide
            mramonleon Ramon Leon added a comment - - edited

            Besides the code in Metrics:

             

            https://github.com/jenkinsci/metrics-plugin/blob/24bf92ebd59095d8f77b2552696b3f210a024bdc/src/main/java/jenkins/metrics/impl/JenkinsMetricProviderImpl.java#L1142

             

            try {
                return future.get();
            } catch (Throwable t) {
                sneakyThrow(t);
                throw new RuntimeException(t);
            }
            

             

            We have here another affected code in Pipeline:

            https://github.com/jenkinsci/workflow-cps-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L940

            try {
                return getCurrentExecutions(false).get(1, TimeUnit.SECONDS).stream().anyMatch(StepExecution::blocksRestart);
            } catch (Exception x) {
                // TODO RestartListener.Default.isReadyToRestart can throw checked exceptions, but AsynchronousExecution.blocksRestart does not currently allow it
                Level level = x.getCause() instanceof RejectedExecutionException ? /* stray Executor past program end? */ Level.FINE : Level.WARNING;
                LOGGER.log(level, "Not blocking restart due to problem checking running steps in " + this, x);
                return false;
            }
            

            We see in the log:

             

             

            2019-10-14 13:16:56.104+0000 [id=16386139]  WARNING o.j.p.w.cps.CpsFlowExecution#blocksRestart: Not blocking restart due to problem checking running steps in CpsFlowExecution[******]]java.util.concurrent.TimeoutException: Timeout waiting for task.
            at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:259)    
            at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91)    
            at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.blocksRestart(CpsFlowExecution.java:940)    
            at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.blocksRestart(WorkflowRun.java:372)    
            at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.displayCell(WorkflowRun.java:375)    
            at hudson.model.Executor.isDisplayCell(Executor.java:668)
            ...

             

            Fortunately, this code just waits 1 second, so it doesn't block anything, although it populates the log with these messages.

            Show
            mramonleon Ramon Leon added a comment - - edited Besides the code in Metrics :   https://github.com/jenkinsci/metrics-plugin/blob/24bf92ebd59095d8f77b2552696b3f210a024bdc/src/main/java/jenkins/metrics/impl/JenkinsMetricProviderImpl.java#L1142   try { return future .get(); } catch (Throwable t) { sneakyThrow(t); throw new RuntimeException(t); }   We have here another affected code in Pipeline : https://github.com/jenkinsci/workflow-cps-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L940 try { return getCurrentExecutions( false ).get(1, TimeUnit.SECONDS).stream().anyMatch(StepExecution::blocksRestart); } catch (Exception x) { // TODO RestartListener.Default.isReadyToRestart can throw checked exceptions, but AsynchronousExecution.blocksRestart does not currently allow it Level level = x.getCause() instanceof RejectedExecutionException ? /* stray Executor past program end? */ Level.FINE : Level.WARNING; LOGGER.log(level, "Not blocking restart due to problem checking running steps in " + this , x); return false ; } We see in the log:     2019-10-14 13:16:56.104+0000 [id=16386139]  WARNING o.j.p.w.cps.CpsFlowExecution#blocksRestart: Not blocking restart due to problem checking running steps in CpsFlowExecution[******]]java.util.concurrent.TimeoutException: Timeout waiting  for  task. at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:259)     at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91)     at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.blocksRestart(CpsFlowExecution.java:940)     at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.blocksRestart(WorkflowRun.java:372)     at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.displayCell(WorkflowRun.java:375)     at hudson.model.Executor.isDisplayCell(Executor.java:668) ...   Fortunately, this code just waits 1 second, so it doesn't block anything, although it populates the log with these messages.
            batmat Baptiste Mathus made changes -
            Labels plugin lts-candidate plugin
            batmat Baptiste Mathus made changes -
            Labels lts-candidate plugin core lts-candidate plugin
            batmat Baptiste Mathus made changes -
            Component/s core [ 15593 ]
            Component/s metrics-plugin [ 18751 ]
            batmat Baptiste Mathus made changes -
            Labels core lts-candidate plugin core lts-candidate metrics
            Hide
            jglick Jesse Glick added a comment -

            I do not believe the workflow-cps code mentioned has any relation to this issue. Perhaps I am missing something.

            Show
            jglick Jesse Glick added a comment - I do not believe the workflow-cps code mentioned has any relation to this issue. Perhaps I am missing something.
            Hide
            mramonleon Ramon Leon added a comment -

            I mentioned this because in the instance we detected the metric threads, there were a lot of log messages of this type. But maybe it's not related and it's just a performance issue.

            Show
            mramonleon Ramon Leon added a comment - I mentioned this because in the instance we detected the metric threads, there were a lot of log messages of this type. But maybe it's not related and it's just a performance issue.
            mramonleon Ramon Leon made changes -
            Summary Possible thread leak 'QueueSubTaskMetrics' in metrics Possible thread leak 'QueueSubTaskMetrics' in metrics - Allow finishing builds when SubTask badly fail
            Hide
            mramonleon Ramon Leon added a comment -

            Added what has been done to the title for the sake of searches

            Show
            mramonleon Ramon Leon added a comment - Added what has been done to the title for the sake of searches
            mramonleon Ramon Leon made changes -
            Status In Review [ 10005 ] Fixed but Unreleased [ 10203 ]
            Resolution Fixed [ 1 ]
            danielbeck Daniel Beck made changes -
            Status Fixed but Unreleased [ 10203 ] Resolved [ 5 ]
            Released As jenkins-2.205
            Hide
            olivergondza Oliver Gondža added a comment -

            Ramon Leon, thanks for the prompt fix of the issue. I would appreciate if you can report back whether this fully mediated the problem in affected production deployment.

            Show
            olivergondza Oliver Gondža added a comment - Ramon Leon , thanks for the prompt fix of the issue. I would appreciate if you can report back whether this fully mediated the problem in affected production deployment.
            olivergondza Oliver Gondža made changes -
            Labels core lts-candidate metrics 2.204.1-fixed core metrics
            Hide
            mramonleon Ramon Leon added a comment -

            It will take some time to land, I can give you some feedback by the end of December

            Show
            mramonleon Ramon Leon added a comment - It will take some time to land, I can give you some feedback by the end of December
            Hide
            raihaan Raihaan Shouhell added a comment -

            We seem to see this issue with metrics. Will this be backported to the LTS release? Oliver Gondža

            We are currently on 2.176.2

            Show
            raihaan Raihaan Shouhell added a comment - We seem to see this issue with metrics. Will this be backported to the LTS release? Oliver Gondža We are currently on 2.176.2
            Hide
            olivergondza Oliver Gondža added a comment -

            This was backported to 2.204.1. RC is expected to be out on Dec 4th, final release on Dec 18th.

            Show
            olivergondza Oliver Gondža added a comment - This was backported to 2.204.1. RC is expected to be out on Dec 4th, final release on Dec 18th.

              People

              • Assignee:
                mramonleon Ramon Leon
                Reporter:
                mramonleon Ramon Leon
              • Votes:
                1 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: