Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47681

Pipeline with parallel jobs hang with EC2 plugin

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • Jenkins 2.60.3 & 2.86
      EC2 Plugin 1.36
      Pipeline Plugin 2.5
      Pipeline API Plugin 2.23.1

      Summary:

      We are seeing major issues running scripted pipeline with parallel jobs using the EC2 plugin. I haven't seen these issues when running freestyle matrix jobs, but they were occurring very frequently when running parallel pipeline jobs that replaced freestyle matrix functionality. I'm able to reproduce these issues using the included sample code. These issues look to be EC2 plugin related but since we have only seen them with parallel pipeline jobs it could be Pipeline plugin related.

      Below are some of the issues we are having.

      • "Build Now" button clicked but pipeline build never started. (Sometimes shows as "pending - ???")
      • Jenkins slaves are slow to start or don't startup at all. (Just a spinning icon on the slave log page)
      • Jobs sit in the queue even though an available idle slave is available (node label matches)
      • Starting two pipeline builds hang and neither of them finish
      • We constantly see "Added node named: JENKINS-SLAVE-NODE-NAME (i-xxxxxxxxxxxxxxxx)" throughout the jenkins.log file
      • Thread dumps during our investigation shows blocking at hudson.plugins.ec2.AmazonEC2Cloud

      Details:

      Below are some details we found in our investigation.

      Calculating countCurrentEC2Slaves takes a while with a large AWS account. Perhaps availableTotalSlaves should be hard coded if instanceCap = Integer.MAX_VALUE and template is null.

      I haven't seen possibleSlavesCount go below 0 so it doesn't appear stop provisioning slaves. Perhaps we should stop provisioning slaves if the possibleSlavesCount is <= 0 and return only the stopped slaves count in the possibleSlavesCount variable. I think adding "&& stateName != InstanceStateName.Stopped" to the countCurrentEC2Slaves calculation would help. The goal should be to start only the stopped slaves if cap is available in the template and then provision a new instances if necessary. I'm seeing the line below throughout the log file even when the instance is already running a job.

      Added node named: JENKINS-SLAVE-NODE-NAME (i-xxxxxxxxxxxxxxxx), We have now 20 computers.

      When the " while (excessWorkload > 0) " loop runs, I see same instance being added multiple times. Shouldn't this loop over different instances? Perhaps the "while" statement in the code should be converted to an "if" statement so it only runs once?

      Oct 20, 2017 9:21:05 PM hudson.plugins.ec2.EC2Cloud provision
      INFO: Added node named: JENKINS-SLAVE-NODE-NAME (i-xxxxxxxxxxxxxxxx), We have now 20 computers
      Oct 20, 2017 9:21:14 PM hudson.plugins.ec2.EC2Cloud provision
      INFO: Added node named: JENKINS-SLAVE-NODE-NAME (i-xxxxxxxxxxxxxxxx), We have now 20 computers
      Oct 20, 2017 9:21:26 PM hudson.plugins.ec2.EC2Cloud provision
      INFO: Added node named: JENKINS-SLAVE-NODE-NAME (i-xxxxxxxxxxxxxxxx), We have now 20 computers
      Oct 20, 2017 9:21:36 PM hudson.plugins.ec2.EC2Cloud provision
      INFO: Added node named: JENKINS-SLAVE-NODE-NAME (i-xxxxxxxxxxxxxxxx), We have now 20 computers
      Oct 20, 2017 9:21:46 PM hudson.plugins.ec2.EC2Cloud provision
      INFO: Added node named: JENKINS-SLAVE-NODE-NAME (i-xxxxxxxxxxxxxxxx), We have now 20 computers
      Oct 20, 2017 9:21:56 PM hudson.plugins.ec2.EC2Cloud provision
      INFO: Added node named: JENKINS-SLAVE-NODE-NAME (i-xxxxxxxxxxxxxxxx), We have now 20 computers
      ...

      We don't use spot instances so the EC2 slave monitor doesn't seem necessary to us. I have seen it take thousands of seconds to complete and it runs every 10 minutes by default. My Jenkins masters have around 200 slaves on them. I prefer not to have this monitoring service run when there are builds actively running. Should looping through the nodes for the check be reserved for when Jenkins.getInstance().getComputer().getBusyExecutors() == 0? We have increased jenkins.ec2.checkAlivePeriod which seems to help.

      Oct 25, 2017 4:33:55 AM hudson.model.AsyncPeriodicWork doRun
      INFO: EC2 alive slaves monitor thread is still running. Execution aborted.
      Oct 25, 2017 4:43:55 AM hudson.model.AsyncPeriodicWork doRun
      INFO: EC2 alive slaves monitor thread is still running. Execution aborted.

      How to reproduce:

      I was able to reproduce the issues using the pipeline code below.

      def stepsForParallel = [:]
      for (int i = 0; i < 300; i++) {
        def s = "subjob_${i}" 
        stepsForParallel[s] = {
          node("JENKINS-SLAVE-NODE-LABEL") {
            sh '''
            date +%c
            '''
          }
        }
      }
      timestamps {
      parallel stepsForParallel
      }
      

      Workarounds:

      It looks like EC2Cloud has synchronization in many places so we increased both hudson.slaves.NodeProvisioner.recurrencePeriod and jenkins.ec2.checkAlivePeriod to help reduce concurrent AWS requests. We also customized a fork of the EC2 plugin with many of the items from the details section above. This appears to have helped reduce some of the blocking, however, we are still testing.  We also skip "countCurrentEC2Slaves" if the previous run resulted in 0 available slaves and occurred recently in an effort to reduce the number of AWS calls.

        1. thread_dump1.txt
          9 kB
        2. thread_dump2.txt
          14 kB
        3. thread_dump3.txt
          9 kB

            francisu Francis Upton
            mkozell Mike Kozell
            Votes:
            2 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: