Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-56036

Spot Instance Plugin Spawns Arbitrary Number of Instances

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • ec2-plugin
    • None

      Feb 07, 2019 4:55:53 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest
      Sending Request: POST https://ec2.us-east-1.amazonaws.com/ / Parameters: ({"Action":["DescribeInstances"],"Version":["2016-11-15"],"InstanceId.1":["i-0cda2343e023df94c"]}Headers: (User-Agent: aws-sdk-java/1.11.457 Linux/4.4.0-47-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.111-b14 java/1.8.0_111 groovy/2.4.12, amz-sdk-invocation-id: 1eb7a6a4-e994-6b97-17bf-e412668194d8, )

      The EC2 Spot instance functionality keeps spawning the spot instances while ignoring the existing ones that are running.

      I see stacktraces like this in Jenkins when this happens.

      INFO: Unexpected number of reservations reported by EC2 for instance id 'i-0cda2343e023df94c', expected 1 result, found []. Instance seems to be dead.
      Feb 07, 2019 4:56:38 PM hudson.plugins.ec2.EC2Cloud provision
      WARNING: SlaveTemplate{ami='ami-005b3a8001dab02a9', labels=''}. Exception during provisioning
      com.amazonaws.AmazonClientException: Unexpected number of reservations reported by EC2 for instance id 'i-0cda2343e023df94c', expected 1 result, found []. Instance seems to be dead.
      at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:54)
      at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25)
      at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:499)
      at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:159)
      at hudson.plugins.ec2.EC2SpotSlave.<init>(EC2SpotSlave.java:44)
      at hudson.plugins.ec2.EC2SpotSlave.<init>(EC2SpotSlave.java:37)
      at hudson.plugins.ec2.SlaveTemplate.newSpotSlave(SlaveTemplate.java:979)
      at hudson.plugins.ec2.SlaveTemplate.provisionSpot(SlaveTemplate.java:919)
      at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:464)
      at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:578)
      at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:594)
      at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
      at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
      at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:62)
      at hudson.slaves.NodeProvisioner$1.run(NodeProvisioner.java:177)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      I believe this is a result of of using a method on a partially constructed object.

      1) a EC2SpotSlave is constructed with a non-null spot instance request id (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/SlaveTemplate.java#L1060)

      2) This constructor is called which calls the super constructor of EC2AbstractSlave. The instance variable spotInstanceRequestId is assigned after the super constructor is called (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2SpotSlave.java#L51)

      3) The EC2AbstractSlave constructor calls fetchLiveInstanceData (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java#L164)

      4) fetchLiveInstanceData ends up calling getInstanceId() (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java#L503)

      5) This is overriden in EC2SpotSlave which calls getSpotRequest() (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2SpotSlave.java#L165)

      6) This calls describeSpotInstanceRequests using spotInstanceRequestId which is null (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2SpotSlave.java#L129)

      7) If you pass in null it returns ALL THE SPOT REQUESTS IN THE REGION IN ANY RANDOM ORDER AMAZON WANTS TO GIVE BACK

      8) We take the first spot request and use that which might not necessarily be the spot request we are interested in. Maybe this plugin appears to work a lot of the time because EC2 mostly gives the latest instance back first. Hmmm..

      9) Random shit starts happening because you are using corrupt data

      So.. Considering that this code can spawn lots of instances when things become corrupt and I was seeing errors like 'Unexpected number of reservations'. I wonder if we can be a bit more defensive and set a flag that prevents instances from being spawned further when things are corrupt.

            thoulen FABRIZIO MANFREDI
            benmmurphy Ben Murphy
            Votes:
            8 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated: