Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58899

ThreadLocal leak causing high CPU usage with DurableTaskStep threads

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Minor
    • Resolution: Fixed
    • Component/s: workflow-api-plugin
    • Labels:
      None
    • Environment:
      Jenkins 2.187
      Linux
      openjdk version "1.8.0_222"
    • Similar Issues:
    • Released As:
      workflow-api 2.37

      Description

      Our large Jenkins instance had a CPU usage spike last weekend.

      After some digging, we found that all `org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep` threads tended to get stuck in `java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry` with various stacks, which seems to be the consequence of a ThreadLocal leak:

       

      "org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#12]: checking <redacted> on i-0db24f1a6b300ea1f" #3845 daemon prio=5 os_prio=0 tid=0x00007f2dd8bd2800 nid=0x30bb runnable [0x00007f2d2a926000]
         java.lang.Thread.State: RUNNABLE
              at java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:617)
              at java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:499)
              at java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:298)
              at java.lang.ThreadLocal.remove(ThreadLocal.java:222)
              at hudson.remoting.Channel.setCurrent(Channel.java:1787)
              at hudson.remoting.UserRequest$NormalResponse.retrieve(UserRequest.java:328)
              at hudson.remoting.Channel.call(Channel.java:957)
              at hudson.FilePath.act(FilePath.java:1072)
              at hudson.FilePath.act(FilePath.java:1061)
              at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:209)
              at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:532)
              at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:489)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)

       

       

      Digged the content of the ThreadLocalMap of each of the live threads of the `ScheduledThreadPoolExecutor` and got something similar:

       

      Thread 3841 org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#8]: checking <redacted> on i-04a6c09bbdc596bbd
      
      ThreadLocalMap size: 65536
      
      Elements count per key type:
      [org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream$FlushControlledOutputStream$1:20704, java.lang.ThreadLocal:47, hudson.remoting.Channel$ThreadLastIoId:34, null:42550, org.codehaus.groovy.runtime.NullObject:2200, java.lang.ThreadLocal$SuppliedThreadLocal:1]

       

      It looks like the following ThreadLocal is never cleared since it gets attached to reused threads from a thread pool: 

      https://github.com/jenkinsci/workflow-api-plugin/blob/907cf64feb2f38e93aebbedf87946b0235c3dc93/src/main/java/org/jenkinsci/plugins/workflow/log/DelayBufferedOutputStream.java#L123

        Attachments

          Issue Links

            Activity

            Hide
            dnusbaum Devin Nusbaum added a comment -

            I think we should call ThreadLocal.remove here instead of trying to reset the original value, since calls to DelayBufferedOutputStream.flushBuffer will always be from a thread in the jenkins.util.Timer thread pool, and so even if we did care about what the original value was (I don't think we do, the default is true, we set it to false just for the call to flush, and then set it back to true, and it is a ThreadLocal so we don't have to worry about interference from other threads, so it will always be true going into flushBuffer IIUC), it would only matter if we happened to use a thread from the pool that we had already used previously, which is not something we can count on.

            Show
            dnusbaum Devin Nusbaum added a comment - I think we should call ThreadLocal.remove here instead of trying to reset the original value, since calls to DelayBufferedOutputStream.flushBuffer will always be from a thread in the jenkins.util.Timer thread pool, and so even if we did care about what the original value was (I don't think we do, the default is true , we set it to false just for the call to flush , and then set it back to true , and it is a ThreadLocal so we don't have to worry about interference from other threads, so it will always be true going into flushBuffer IIUC), it would only matter if we happened to use a thread from the pool that we had already used previously, which is not something we can count on.
            Hide
            glavoie Gabriel Lavoie added a comment -

            As additional data to the comment above, I forgot to mention that all leaked values were boolean "true". 

            Show
            glavoie Gabriel Lavoie added a comment - As additional data to the comment above, I forgot to mention that all leaked values were boolean "true". 
            Hide
            dnusbaum Devin Nusbaum added a comment -

            Gabriel Lavoie If you want to review https://github.com/jenkinsci/workflow-api-plugin/pull/100, and/or test it out on your instance to confirm whether it fixes the issue (Especially to see if we should modify FlushControlledOutputStream.flush in the same way based on real data), that would be appreciated, thanks!

            Show
            dnusbaum Devin Nusbaum added a comment - Gabriel Lavoie If you want to review https://github.com/jenkinsci/workflow-api-plugin/pull/100 , and/or test it out on your instance to confirm whether it fixes the issue (Especially to see if we should modify FlushControlledOutputStream.flush in the same way based on real data), that would be appreciated, thanks!
            Hide
            glavoie Gabriel Lavoie added a comment -

            Will get back to you by the end of the week.

            Show
            glavoie Gabriel Lavoie added a comment - Will get back to you by the end of the week.
            Hide
            dnusbaum Devin Nusbaum added a comment -

            A fix for this issue was just released in version 2.37 of Pipeline API Plugin. Gabriel Lavoie thank you for the detailed bug report and for helping test the fix!

            Show
            dnusbaum Devin Nusbaum added a comment - A fix for this issue was just released in version 2.37 of Pipeline API Plugin. Gabriel Lavoie thank you for the detailed bug report and for helping test the fix!

              People

              • Assignee:
                dnusbaum Devin Nusbaum
                Reporter:
                glavoie Gabriel Lavoie
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: