Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45553

Parallel pipeline execution scales poorly

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Critical
    • Resolution: Fixed
    • Environment:
    • Similar Issues:

      Description

      Execution of parallel blocks scales poorly for values of N > 100.  With ~50 nodes (each with 4 executors, for a total of ~200 slots), the following pipeline job takes extraordinarily long to execute:

       

      def stepsForParallel = [:]
      for (int i = 0; i < Integer.valueOf(params.SUB_JOBS); i++) {
        def s = "subjob_${i}" 
        stepsForParallel[s] = {
          node("darwin") {
            echo "hello"
          }
        }
      }
      parallel stepsForParallel
      

       

      SUB_JOBS   Time (sec)
      ---------------------
       100         10
       200         40
       300         96
       400        214
       500        392
       600        660
       700        960
       800       1500
       900       2220
      1000       gave up...

      At no point does the underlying system become taxed (CPU utilization is very low, as this is a very beefy system – 28 cores, 128GB RAM, SSDs)

      CPU and Thread CPU Time Sampling (via VisualVM) are attached for reference.

       

       

       

       

       

        Attachments

          Issue Links

            Activity

            Hide
            florian_meser Florian Meser added a comment -

            Hello Sam Van Oort, like you mentioned above I just tested the new versions and there definitely is an improvement. I updated short after you wrote that comment and I'm still using those versions. We pretty much rely on this feature since our whole test infrastructure depends on deploying data on nodes for many branches so we pretty much got a 24/7 running Jenkins (-with up to 1-2k executors in queue).

            Never the less the scaling can not be considered as stable. We got many tests that need ~2m and wait ~10-15min (worst case) for being processed by Jenkins. Like mentioned in https://issues.jenkins-ci.org/browse/JENKINS-45876 there seems to be kind of an quadratic or exponential correlation. That means even if there is a big improvement it gets to it's limits when crossing this edge.

            In my opinion there is still room for further improvements to ensure also large jenkins environments become more effective.

            Show
            florian_meser Florian Meser added a comment - Hello Sam Van Oort , like you mentioned above I just tested the new versions and there definitely is an improvement. I updated short after you wrote that comment and I'm still using those versions. We pretty much rely on this feature since our whole test infrastructure depends on deploying data on nodes for many branches so we pretty much got a 24/7 running Jenkins (-with up to 1-2k executors in queue). Never the less the scaling can not be considered as stable. We got many tests that need ~2m and wait ~10-15min (worst case) for being processed by Jenkins. Like mentioned in https://issues.jenkins-ci.org/browse/JENKINS-45876  there seems to be kind of an quadratic or exponential correlation. That means even if there is a big improvement it gets to it's limits when crossing this edge. In my opinion there is still room for further improvements to ensure also large jenkins environments become more effective.
            Hide
            svanoort Sam Van Oort added a comment -

            Florian Meser I agree completely that there is some room for further optimization of massively-parallel pipeline execution – the best place to currently follow the work and investigations is https://issues.jenkins-ci.org/browse/JENKINS-47724 now.  That ticket also includes some concrete advice that may help with your scenario. 

            If you'd like to add some quantitative scaling observations to help identify where the bottleneck is, that might be of some assistance – I also expect the work currently in beta release from JENKINS-47170 will help a bit (reduces the per-flownode overheads associated with pipelines quite significantly – that's a small component of parallel execution).

            Very likely you'll see a big improvement from the next phase of that work, https://issues.jenkins-ci.org/browse/JENKINS-38381, which was the culprit here for a lot of the nonlinear behaviors – that's slated to be my next strategic push on performance, along with some tactical fixes that may help with your scenario.

            Show
            svanoort Sam Van Oort added a comment - Florian Meser  I agree completely that there is some room for further optimization of massively-parallel pipeline execution – the best place to currently follow the work and investigations is  https://issues.jenkins-ci.org/browse/JENKINS-47724 now.  That ticket also includes some concrete advice that may help with your scenario.  If you'd like to add some quantitative scaling observations to help identify where the bottleneck is, that might be of some assistance – I also expect the work currently in beta release from JENKINS-47170 will help a bit (reduces the per-flownode overheads associated with pipelines quite significantly – that's a small component of parallel execution). Very likely you'll see a big improvement from the next phase of that work, https://issues.jenkins-ci.org/browse/JENKINS-38381 , which was the culprit here for a lot of the nonlinear behaviors – that's slated to be my next strategic push on performance, along with some tactical fixes that may help with your scenario.
            Hide
            svanoort Sam Van Oort added a comment -

            One other comment: the bottlenecks appears to be only with massive parallels in a single pipeline – if you break your job into smaller ones with fewer parallel branches in each, this overheads per-branch will be less important.

            Pipeline is also never going to achieve fully linear scale-out with large numbers of executors, because only some parts of the execution can take full advantage of parallel execution – primarily the shell/batch/powershell steps that should be doing the bulk of work.  Our work is primarily focused on reducing the other overheads so it can spend more time executing those steps. 

            Amdahl's Law in spades, basically.

            Show
            svanoort Sam Van Oort added a comment - One other comment: the bottlenecks appears to be only with massive parallels in a single pipeline – if you break your job into smaller ones with fewer parallel branches in each, this overheads per-branch will be less important. Pipeline is also never going to achieve fully linear scale-out with large numbers of executors, because only some parts of the execution can take full advantage of parallel execution – primarily the shell/batch/powershell steps that should be doing the bulk of work.  Our work is primarily focused on reducing the other overheads so it can spend more time executing those steps.  Amdahl's Law in spades, basically.
            Hide
            florian_meser Florian Meser added a comment - - edited

            Sam Van Oort I'm currently trying to implement some time measurement to get quantitative scaling observations. Currently I don't got much time to spent for that though. As far as I got something i'll let you know.

            I don't know if this is offtopic but it seems that another neck breaker just came in. Therefor the question: are there any observation regarding the Meltdown/Spectre Windows7 updates topic which, again, seem to dramatic reduce the performance of our so called "massive parallels in a single pipeline"?

            I'm observing a dramatic loss of performance although no changes in our Jenkins-Pipeline were made regarding this symptomatic. With KB4056894 there was definitely a patch containing Meltdown/Spectre topics. I'm quiet curious if I'm the only one who is having this kind of trouble.

            Show
            florian_meser Florian Meser added a comment - - edited Sam Van Oort I'm currently trying to implement some time measurement to get quantitative scaling observations. Currently I don't got much time to spent for that though. As far as I got something i'll let you know. I don't know if this is offtopic but it seems that another neck breaker just came in. Therefor the question: are there any observation regarding the Meltdown/Spectre Windows7 updates topic which, again, seem to dramatic reduce the performance of our so called "massive parallels in a single pipeline"? I'm observing a dramatic loss of performance although no changes in our Jenkins-Pipeline were made regarding this symptomatic. With KB4056894 there was definitely a patch containing Meltdown/Spectre topics. I'm quiet curious if I'm the only one who is having this kind of trouble.
            Hide
            svanoort Sam Van Oort added a comment -

            Florian Meser  I'm not sure what the performance impact of the Meltdown/Spectre updates is on Windows - not really set up for scaling tests on Windows, but it might be related to changes in IO performance. 

            Please try out the advice I just added in the latest comment on  https://issues.jenkins-ci.org/browse/JENKINS-47724 – this should help considerably.  The last few months have been heavily focused on performance improvements to Pipeline and it should show in a big way.

            Show
            svanoort Sam Van Oort added a comment - Florian Meser   I'm not sure what the performance impact of the Meltdown/Spectre updates is on Windows - not really set up for scaling tests on Windows, but it might be related to changes in IO performance.  Please try out the advice I just added in the latest comment on   https://issues.jenkins-ci.org/browse/JENKINS-47724 – this should help considerably.  The last few months have been heavily focused on performance improvements to Pipeline and it should show in a big way.

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                tskrainar Tom Skrainar
              • Votes:
                4 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: