Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Component/s: ec2-plugin
Labels:
None
Environment:
ec2-plugin 1.31
Jenkins 1.642.2

Similar Issues:

Show

Occasionally one of our stopped slaves will not restart, because ec2-plugin is not able to connect to it over SSH. The plugin aborts after the launch timeout, enumerates existing slaves, and selects the exact same unresponsive one to provision, even if there are many other stopped slaves available.

In the system log, I see ec2-plugin repeatedly enumerate all stopped slaves matching the AMI (~20 available) and select the same one: Using existing slave: i-2a021dbe. In the log for that slave, I can see it wait to connect over SSH, abort at the configured launch timeout of 180s, then start attempting to connect again.

Ideally, I would like ec2-plugin to delete any node that fails to launch. When I manually delete the node, the others begin to start up as expected. Marking the node temporarily offline would also be OK, if it doesn't trigger ~~JENKINS-33945~~. A lesser mitigation would be to select an existing slave at random, instead of deterministically.

Marking as critical because this can completely prevent any stopped nodes from coming back up.

Assignee:: Francis Upton

Reporter:: Patrick Mihelich

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2016-04-23 02:20

Updated:: 2016-05-25 18:20

Resolved:: 2016-05-25 18:20

Details

Description

Attachments

Activity

People

Dates