Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27650

Page loads slow with hundreds of throttled builds in queue

    Details

    • Similar Issues:

      Description

      When there are hundreds of throttled builds in the queue, page loads increase by an order of magnitude.

      Steps to reproduce:

      1. Run Jenkins 1.580.2 and latest throttle-concurrent-builds plugin
      2. Create a matrix job with 200 combinations (attached)
      3. In the same job, select "Throttle Concurrent Builds" with a maximum of 7 builds throttled as part of a category called 'semaphore'
      4. Set number of executors on the 'master' queue to 200
      5. Run the job. There should only be 7 builds running due to the throttling

      Page load times will increase by an order of magnitude – I observed 10 seconds from

      time curl http://localhost:8080/jenkins/ajaxBuildQueue

      If you remove the throttling in the job configuration, the page load times will be under 50 ms.

        Attachments

          Issue Links

            Activity

            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Hi Ryan,

            This issue is being caused by the behavior of Queue cache in Jenkins core.
            See the discussion in JENKINS-20046

            It's possible to somehow optimize the performance of the plugin (e.g. by internal caching), but in general I would recommend to improve the behavior of Queue caching for web interfaces.

            Show
            oleg_nenashev Oleg Nenashev added a comment - Hi Ryan, This issue is being caused by the behavior of Queue cache in Jenkins core. See the discussion in JENKINS-20046 It's possible to somehow optimize the performance of the plugin (e.g. by internal caching), but in general I would recommend to improve the behavior of Queue caching for web interfaces.
            Hide
            danielbeck Daniel Beck added a comment -

            Oleg: Would be interesting to see the effect Stephen's change has starting with 1.607.

            http://jenkins-ci.org/changelog
            https://github.com/jenkinsci/jenkins/pull/1596

            Show
            danielbeck Daniel Beck added a comment - Oleg: Would be interesting to see the effect Stephen's change has starting with 1.607. http://jenkins-ci.org/changelog https://github.com/jenkinsci/jenkins/pull/1596
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Yes, we had a discussion with Ryan on this topic in Skype.

            My opinion:

            • Changes from Stephen Connolly will definitely improve the performance
            • Queue::CachedItemList still needs an improvement to provide a better caching
            • I would move the periodic cache refresh to an independent thread with a configurable priority (high by default)
            Show
            oleg_nenashev Oleg Nenashev added a comment - Yes, we had a discussion with Ryan on this topic in Skype. My opinion: Changes from Stephen Connolly will definitely improve the performance Queue::CachedItemList still needs an improvement to provide a better caching I would move the periodic cache refresh to an independent thread with a configurable priority (high by default)
            Hide
            recampbell Ryan Campbell added a comment -

            The thread dump seen when reproducing the issue.

            Show
            recampbell Ryan Campbell added a comment - The thread dump seen when reproducing the issue.
            Hide
            recampbell Ryan Campbell added a comment -

            Is it possible to do this calculation without holding a lock on the Queue the entire time?

            For instance, could it be done asynchronously in some executor pool? As it stands, the entire process keeps the queue locked because it is called from Queue.maintain(). Until the result is ready, the plugin could return a Blockage reason of "Calculating..." But this is just speculation – I'm casting about for a solution limited to the plugin.

            I realize this can be fixed in core, but for many people on older LTS, this plugin is basically broken for high-end usage.

            Show
            recampbell Ryan Campbell added a comment - Is it possible to do this calculation without holding a lock on the Queue the entire time? For instance, could it be done asynchronously in some executor pool? As it stands, the entire process keeps the queue locked because it is called from Queue.maintain(). Until the result is ready, the plugin could return a Blockage reason of "Calculating..." But this is just speculation – I'm casting about for a solution limited to the plugin. I realize this can be fixed in core, but for many people on older LTS, this plugin is basically broken for high-end usage.
            Hide
            jglick Jesse Glick added a comment -

            Will see if https://github.com/jenkinsci/throttle-concurrent-builds-plugin/pull/26 can be improved upon. From thread dumps it is clear that this plugin is to blame for wasting resources. Yes 1.607+ will no longer block the web UI due to problems like this, but the backend will still be excessively loaded.

            Show
            jglick Jesse Glick added a comment - Will see if https://github.com/jenkinsci/throttle-concurrent-builds-plugin/pull/26 can be improved upon. From thread dumps it is clear that this plugin is to blame for wasting resources. Yes 1.607+ will no longer block the web UI due to problems like this, but the backend will still be excessively loaded.
            Hide
            recampbell Ryan Campbell added a comment -

            The referenced job configuration.

            Show
            recampbell Ryan Campbell added a comment - The referenced job configuration.
            Hide
            jglick Jesse Glick added a comment -

            Turns out those steps to reproduce do not work after all.

            Show
            jglick Jesse Glick added a comment - Turns out those steps to reproduce do not work after all.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            @Jesse
            The issue appears when you have a big jobs traffic through the queue (real submissions instead of canTake() calls). I would recommend to...

            • decrease the execution time of test jobs (e.g. to 1 second) and increase the build queue to about 10k.
            • reconfigure throttling policies to support many submissions at once
            Show
            oleg_nenashev Oleg Nenashev added a comment - @Jesse The issue appears when you have a big jobs traffic through the queue (real submissions instead of canTake() calls). I would recommend to... decrease the execution time of test jobs (e.g. to 1 second) and increase the build queue to about 10k. reconfigure throttling policies to support many submissions at once
            Hide
            jglick Jesse Glick added a comment -

            To reproduce: clone the attached repo, and inside it

            docker build -t jenkins-27650 .
            docker run -p 8080:8080 jenkins-27650
            

            and then from http://localhost:8080/ click the Build button next to runme. Jenkins will quickly become unresponsive.

            Show
            jglick Jesse Glick added a comment - To reproduce: clone the attached repo, and inside it docker build -t jenkins-27650 . docker run -p 8080:8080 jenkins-27650 and then from http://localhost:8080/ click the Build button next to runme . Jenkins will quickly become unresponsive.
            Hide
            jglick Jesse Glick added a comment -

            Experimenting, no luck so far.

            Show
            jglick Jesse Glick added a comment - Experimenting, no luck so far.
            Hide
            jglick Jesse Glick added a comment -

            JENKINS-19623 apparently was not enough.

            Show
            jglick Jesse Glick added a comment - JENKINS-19623 apparently was not enough.
            Hide
            jglick Jesse Glick added a comment -

            Tried various things in PR 27, but as explained there, the result is not satisfactory. I suspect JENKINS-27708 needs to be addressed first.

            My fear is that the current basic design of ThrottleQueueTaskDispatcher just cannot be made to scale well. I wonder if it would be better to invert the logic: implement ExecutorListener (as a second extension) to track what is running in each category, keeping a map from nodes to a histogram of task counts running by category (WeakHashMap<Node,HashMap<String,Integer>>?). Then canTake/canRun would only need to look up configuration for the proposed job, and do a table lookup to see the current count and compare that to the configured limit.

            I am not sure how that would relate to JENKINS-27708. ExecutorListener seems to be called with the Queue lock held, which is good, but that problem seems to stem from QueueTaskDispatcher being asked to make decisions about multiple jobs before any of them are actually scheduled. The call to taskAccepted does come from new WorkUnitContext, within maintain, so the question is whether this is interleaved with QueueTaskDispatcher calls, or after all of them have completed.

            Show
            jglick Jesse Glick added a comment - Tried various things in PR 27, but as explained there, the result is not satisfactory. I suspect JENKINS-27708 needs to be addressed first. My fear is that the current basic design of ThrottleQueueTaskDispatcher just cannot be made to scale well. I wonder if it would be better to invert the logic: implement ExecutorListener (as a second extension) to track what is running in each category, keeping a map from nodes to a histogram of task counts running by category ( WeakHashMap<Node,HashMap<String,Integer>> ?). Then canTake / canRun would only need to look up configuration for the proposed job, and do a table lookup to see the current count and compare that to the configured limit. I am not sure how that would relate to JENKINS-27708 . ExecutorListener seems to be called with the Queue lock held, which is good, but that problem seems to stem from QueueTaskDispatcher being asked to make decisions about multiple jobs before any of them are actually scheduled. The call to taskAccepted does come from new WorkUnitContext , within maintain , so the question is whether this is interleaved with QueueTaskDispatcher calls, or after all of them have completed.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            /** Update to the previous comment:

            • ExecutorListener is not an extension point, we cannot make this approach work
            • There's no listeners in Jenkins core, that could reliably deliver the info
              */

            I've tried to introduce a light-weight off-the-queue caching in PR #28. The result was not satisfactory as well. The performance of canTake() is being improved by up to 10 times on my local benchmarks, but it still no enough to resolve the issue.

            We could somehow merge PRs #27 and #28, but I'm afraid the solution will stay unreliable. An additional synchronisation will be required in such case => scheduling behaviour will be impacted due to the injected quietTimes.

            Hacking of the load balancer could help, but there will be a conflict with other plugins

            Show
            oleg_nenashev Oleg Nenashev added a comment - /** Update to the previous comment: ExecutorListener is not an extension point, we cannot make this approach work There's no listeners in Jenkins core, that could reliably deliver the info */ I've tried to introduce a light-weight off-the-queue caching in PR #28. The result was not satisfactory as well. The performance of canTake() is being improved by up to 10 times on my local benchmarks, but it still no enough to resolve the issue. We could somehow merge PRs #27 and #28, but I'm afraid the solution will stay unreliable. An additional synchronisation will be required in such case => scheduling behaviour will be impacted due to the injected quietTimes. Hacking of the load balancer could help, but there will be a conflict with other plugins
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Not in progress anymore.

            Some improvement bits have been integrated into the plugin, but it's not enough IMHO

            Show
            oleg_nenashev Oleg Nenashev added a comment - Not in progress anymore. Some improvement bits have been integrated into the plugin, but it's not enough IMHO

              People

              • Assignee:
                Unassigned
                Reporter:
                recampbell Ryan Campbell
              • Votes:
                12 Vote for this issue
                Watchers:
                20 Start watching this issue

                Dates

                • Created:
                  Updated: