Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-56036

Spot Instance Plugin Spawns Arbitrary Number of Instances

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Component/s: ec2-plugin
    • Labels:
      None
    • Similar Issues:

      Description

      Feb 07, 2019 4:55:53 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest
      Sending Request: POST https://ec2.us-east-1.amazonaws.com/ / Parameters: ({"Action":["DescribeInstances"],"Version":["2016-11-15"],"InstanceId.1":["i-0cda2343e023df94c"]}Headers: (User-Agent: aws-sdk-java/1.11.457 Linux/4.4.0-47-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.111-b14 java/1.8.0_111 groovy/2.4.12, amz-sdk-invocation-id: 1eb7a6a4-e994-6b97-17bf-e412668194d8, )

      The EC2 Spot instance functionality keeps spawning the spot instances while ignoring the existing ones that are running.

      I see stacktraces like this in Jenkins when this happens.

      INFO: Unexpected number of reservations reported by EC2 for instance id 'i-0cda2343e023df94c', expected 1 result, found []. Instance seems to be dead.
      Feb 07, 2019 4:56:38 PM hudson.plugins.ec2.EC2Cloud provision
      WARNING: SlaveTemplate{ami='ami-005b3a8001dab02a9', labels=''}. Exception during provisioning
      com.amazonaws.AmazonClientException: Unexpected number of reservations reported by EC2 for instance id 'i-0cda2343e023df94c', expected 1 result, found []. Instance seems to be dead.
      at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:54)
      at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25)
      at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:499)
      at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:159)
      at hudson.plugins.ec2.EC2SpotSlave.<init>(EC2SpotSlave.java:44)
      at hudson.plugins.ec2.EC2SpotSlave.<init>(EC2SpotSlave.java:37)
      at hudson.plugins.ec2.SlaveTemplate.newSpotSlave(SlaveTemplate.java:979)
      at hudson.plugins.ec2.SlaveTemplate.provisionSpot(SlaveTemplate.java:919)
      at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:464)
      at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:578)
      at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:594)
      at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
      at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
      at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:62)
      at hudson.slaves.NodeProvisioner$1.run(NodeProvisioner.java:177)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      I believe this is a result of of using a method on a partially constructed object.

      1) a EC2SpotSlave is constructed with a non-null spot instance request id (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/SlaveTemplate.java#L1060)

      2) This constructor is called which calls the super constructor of EC2AbstractSlave. The instance variable spotInstanceRequestId is assigned after the super constructor is called (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2SpotSlave.java#L51)

      3) The EC2AbstractSlave constructor calls fetchLiveInstanceData (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java#L164)

      4) fetchLiveInstanceData ends up calling getInstanceId() (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java#L503)

      5) This is overriden in EC2SpotSlave which calls getSpotRequest() (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2SpotSlave.java#L165)

      6) This calls describeSpotInstanceRequests using spotInstanceRequestId which is null (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2SpotSlave.java#L129)

      7) If you pass in null it returns ALL THE SPOT REQUESTS IN THE REGION IN ANY RANDOM ORDER AMAZON WANTS TO GIVE BACK

      8) We take the first spot request and use that which might not necessarily be the spot request we are interested in. Maybe this plugin appears to work a lot of the time because EC2 mostly gives the latest instance back first. Hmmm..

      9) Random shit starts happening because you are using corrupt data

      So.. Considering that this code can spawn lots of instances when things become corrupt and I was seeing errors like 'Unexpected number of reservations'. I wonder if we can be a bit more defensive and set a flag that prevents instances from being spawned further when things are corrupt.

        Attachments

          Activity

          Hide
          benmmurphy Ben Murphy added a comment -

          This actually might not be so bad. I think when it fails it rechecks the instance limit by looking at all the instances tagged with jenkins so it won't actually try and spawn more than the instance limit. This is a kind of cloudy interpretation of the code so don't take my word on it. :/

          Show
          benmmurphy Ben Murphy added a comment - This actually might not be so bad. I think when it fails it rechecks the instance limit by looking at all the instances tagged with jenkins so it won't actually try and spawn more than the instance limit. This is a kind of cloudy interpretation of the code so don't take my word on it. :/
          Hide
          benmmurphy Ben Murphy added a comment -

          This seems to have been broken in this commit: dd7bdefc4a214934facb93306c33bcda1c9a3a9a

          Oh.. did i forget to mention i hate OO and random side effects :/

          Show
          benmmurphy Ben Murphy added a comment - This seems to have been broken in this commit: dd7bdefc4a214934facb93306c33bcda1c9a3a9a Oh.. did i forget to mention i hate OO and random side effects :/
          Hide
          thoulen FABRIZIO MANFREDI added a comment -

          Thanks, I will provide a fix in the 1.43 (I hope)

          Show
          thoulen FABRIZIO MANFREDI added a comment - Thanks, I will provide a fix in the 1.43 (I hope)
          Hide
          akennealy Andy Kennealy added a comment -

          I really need a fix for the deadlock issue which is apparently fixed in latest version, but I had to roll back to 1.39 due to this issue here, jenkins-55720, and JENKINS-55639.

          I need to restart Jenkins multiple times a day because of the deadlock issue. I think I'm going to try reverting back to 1.36

          Show
          akennealy Andy Kennealy added a comment - I really need a fix for the deadlock issue which is apparently fixed in latest version, but I had to roll back to 1.39 due to this issue here, jenkins-55720, and  JENKINS-55639 . I need to restart Jenkins multiple times a day because of the deadlock issue. I think I'm going to try reverting back to 1.36
          Hide
          thaipham Thai Pham added a comment -

          FABRIZIO MANFREDI do you have any ETA on when 1.43 will be released?

          Show
          thaipham Thai Pham added a comment - FABRIZIO MANFREDI do you have any ETA on when 1.43 will be released?
          Hide
          laszlog Laszlo Gaal added a comment -

          Saw the same thing happening on Jenkins v2.150.2 and EC2 plugin version 1.42.

          Ben Murphy, my experience seems to confirm yours: seeing one request for a worker, the plugin spun up as many spot instances as the instance limit was.

          Show
          laszlog Laszlo Gaal added a comment - Saw the same thing happening on Jenkins v2.150.2 and EC2 plugin version 1.42. Ben Murphy , my experience seems to confirm yours: seeing one request for a worker, the plugin spun up as many spot instances as the instance limit was.

            People

            • Assignee:
              thoulen FABRIZIO MANFREDI
              Reporter:
              benmmurphy Ben Murphy
            • Votes:
              7 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated: