Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-34408

EC2 plugin repeatedly tries to provision an unresponsive slave

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Critical Critical
    • ec2-plugin
    • None
    • ec2-plugin 1.31
      Jenkins 1.642.2

      Occasionally one of our stopped slaves will not restart, because ec2-plugin is not able to connect to it over SSH. The plugin aborts after the launch timeout, enumerates existing slaves, and selects the exact same unresponsive one to provision, even if there are many other stopped slaves available.

      In the system log, I see ec2-plugin repeatedly enumerate all stopped slaves matching the AMI (~20 available) and select the same one: Using existing slave: i-2a021dbe. In the log for that slave, I can see it wait to connect over SSH, abort at the configured launch timeout of 180s, then start attempting to connect again.

      Ideally, I would like ec2-plugin to delete any node that fails to launch. When I manually delete the node, the others begin to start up as expected. Marking the node temporarily offline would also be OK, if it doesn't trigger JENKINS-33945. A lesser mitigation would be to select an existing slave at random, instead of deterministically.

      Marking as critical because this can completely prevent any stopped nodes from coming back up.

            francisu Francis Upton
            mihelich Patrick Mihelich
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: