Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27193

EC2 plugin still misdetects not-yet-started slaves, produces zombie slaves

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • ec2-plugin
    • None

      Background: "Zombie slaves" (build slaves started by EC2 plugin, but not used for builds, and not shutdown properly) has been a problem with EC2 plugin since we started to use it. With older EC2 plugin versions, such slaves were visible in Jenkins, marked as "offline". We upgraded to newer plugin version (from 1.18 to 1.24, then 1.25/1.26) and rejoiced, as zombie slave issues appeared to have been gone. Until we checked our AWS web console and saw bunch of zombie slaves . So, the only change is that they're no longer visible in Jenkins, and thus only harder to notice.

      So, I went to investigate what's up with it now. Here's typical exception regarding a zombie slave in Jenkins log:

      Mar 02, 2015 6:36:57 AM hudson.slaves.NodeProvisioner update
      WARNING: Provisioned slave Kernel cloud (ami-cb2509a2) failed to launch
      com.amazonaws.AmazonServiceException: The instance ID 'i-ebe58f04' does not exist (Service: AmazonEC2; Status Code: 400; Error Code: InvalidInstanceID.NotFound; Request ID: 2d4058ea-e5ca-44c8-aeb5-08d4fe637d41)
              at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:886)
              at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:484)
              at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:256)
              at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:8798)
              at com.amazonaws.services.ec2.AmazonEC2Client.createTags(AmazonEC2Client.java:4990)
              at hudson.plugins.ec2.SlaveTemplate.updateRemoteTags(SlaveTemplate.java:732)
              at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:426)
              at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:287)
              at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:398)
              at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:394)
              at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:745)
      

      Looking at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:426) (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/SlaveTemplate.java#L426), there's already a loop to try to perform operation with sleep between attempts. But it checks for "InvalidInstanceRequestID" error. Note that there was already https://github.com/jenkinsci/ec2-plugin/commit/9a596f0577b29a3e1835143f5d51520babdd7c1f to correct "typo in error code". However, looking at http://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html , there's no "InvalidInstanceRequestID" error selector (as currently in the code), but there's "InvalidInstanceID"
      (as in the exception above).

      So the proper fix appears to be to replace error code in https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/SlaveTemplate.java#L429 to "InvalidInstanceID.NotFound".

            francisu Francis Upton
            pfalcon Paul Sokolovsky
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: