Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23792

PATCH: EC2 plugin idles-down nodes that are still launching

    Details

    • Similar Issues:

      Description

      The ec2-plugin will treat a still-launching node as idle (because it has no jbs running) and will thus shut it down if its idle timeout expires.

      This means that if the idle timeout for a node is shorter than the time it takes to do one-off configuration on first launch, it will be shut down before it finishes starting up, aborting any init script that might be in progress.

      That's a clear bug - a node isn't idle when it's launching, and shouldn't be idled out until it's first made a connection with its slave agent.

        Attachments

          Issue Links

            Activity

            Hide
            ringerc Craig Ringer added a comment -

            This also causes a race condition in demand-launched workers, where:

            • ec2-plugin launches a new node when one is requested by the executor
            • the node takes longer than the idle timeout to start
            • ec2-plugin stops the node
            • the executor waits indefinitely for the node to become available

            Jenkins doesn't try to start the node again, because as far as it's concerned the node is still starting up. ec2-plugin doesn't start it because as far as it's concerned it started, went idle, and was stopped.

            Show
            ringerc Craig Ringer added a comment - This also causes a race condition in demand-launched workers, where: ec2-plugin launches a new node when one is requested by the executor the node takes longer than the idle timeout to start ec2-plugin stops the node the executor waits indefinitely for the node to become available Jenkins doesn't try to start the node again, because as far as it's concerned the node is still starting up. ec2-plugin doesn't start it because as far as it's concerned it started, went idle, and was stopped.
            Hide
            ringerc Craig Ringer added a comment -

            As a workaround for this, you can set the idle time longer than the longest period a node might reasonably take to run its init script.

            If you're using start/stop instead of terminate for your nodes, they probably take a while to initialize. So this limits how eagerly you can terminate them. Still, it works.

            Show
            ringerc Craig Ringer added a comment - As a workaround for this, you can set the idle time longer than the longest period a node might reasonably take to run its init script. If you're using start/stop instead of terminate for your nodes, they probably take a while to initialize. So this limits how eagerly you can terminate them. Still, it works.
            Hide
            ringerc Craig Ringer added a comment -

            Actually, that turns out to be insufficient. I just hit the case again with an already-built instance that takes significantly less than the idle timer to boot from stopped.

            It looks like it's an issue if the build time takes longer than the timeout and there's a

            Output for node launch looks like:

            Starting existing instance: i-205dcf62 result:{StartingInstances: [{InstanceId: i-205dcf62,CurrentState: {Code: 0,Name: pending},PreviousState: {Code: 80,Name: stopped}}]}
            Waiting for SSH to come up. Sleeping 5.
            Connecting to 10.0.1.206 on port 22, with timeout 10000.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to 10.0.1.206 on port 22, with timeout 10000.
            

            (repeating endlessly), but when the node state is examined in AWS it's shown as stopped.

            It's not clear from the ec2 plugin logs why this is happening; they look fairly reasonable:

            Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2RetentionStrategy _check
            Idle timeout: Amazon Linux 2012.09 EBS 64-bit  (i-205dcf62)
            Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave idleTimeout
            EC2 instance idle time expired: i-205dcf62
            Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave stop
            EC2 instance stopped: i-205dcf62
            

            those are logs for hudson.plugins.ec2 at log level FINEST.

            I'm still speculating that there's a race - or two - here. It's not clear if this is the same one as above, or if this is another issue with a race between requesting a new instance and idle timeout.

            Show
            ringerc Craig Ringer added a comment - Actually, that turns out to be insufficient. I just hit the case again with an already-built instance that takes significantly less than the idle timer to boot from stopped. It looks like it's an issue if the build time takes longer than the timeout and there's a Output for node launch looks like: Starting existing instance: i-205dcf62 result:{StartingInstances: [{InstanceId: i-205dcf62,CurrentState: {Code: 0,Name: pending},PreviousState: {Code: 80,Name: stopped}}]} Waiting for SSH to come up. Sleeping 5. Connecting to 10.0.1.206 on port 22, with timeout 10000. Waiting for SSH to come up. Sleeping 5. Connecting to 10.0.1.206 on port 22, with timeout 10000. (repeating endlessly), but when the node state is examined in AWS it's shown as stopped. It's not clear from the ec2 plugin logs why this is happening; they look fairly reasonable: Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2RetentionStrategy _check Idle timeout: Amazon Linux 2012.09 EBS 64-bit (i-205dcf62) Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave idleTimeout EC2 instance idle time expired: i-205dcf62 Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave stop EC2 instance stopped: i-205dcf62 those are logs for hudson.plugins.ec2 at log level FINEST. I'm still speculating that there's a race - or two - here. It's not clear if this is the same one as above, or if this is another issue with a race between requesting a new instance and idle timeout.
            Hide
            ringerc Craig Ringer added a comment -

            I've put together a change to increase the logging detail on this:

            https://github.com/jenkinsci/ec2-plugin/pull/101

            While reading for this I also noticed that in hudson.model.Computer, getIdleStartMilliseconds() returns Long.MIN_VALUE if there are no executors.

            If there are executors, then the greatest of the last executed job time and the executor's owning computer's getConnectTime() are returned. The executor's last executed job time is initialized to the executor's creation time.

            The Computer's connectTime is initialized to zero, but is set to the current time when proactive launch begins.

            So unless a node is actively running a job on an executor, it's reported as having been idle since creation, even if it's busily running init scripts and other setup so the slave agent hasn't connected and created executors yet.

            That's why it's being stopped again before it's started properly.

            I think the best fix for this will be for EC2Computer to @Override getIdleStartMilliseconds(...) and report a time in the future (i.e. busy) while the node is still starting and hasn't yet connected its slave agent.

            Or at least until the init script finishes running, in EC2UnixLauncher.launch(...)

            It looks like testing isOffline() should be sufficient for that - if the node is offline, it's clearly not a candidate for idle shutdown.

            To be really picky, we could instead have the ComputerLauncher set a flag on the Computer when init scripts have finished and the slave agent has been launched. However, that's the point at which a channel is registered and isOffline becomes false, so there seems little point.

            Show
            ringerc Craig Ringer added a comment - I've put together a change to increase the logging detail on this: https://github.com/jenkinsci/ec2-plugin/pull/101 While reading for this I also noticed that in hudson.model.Computer, getIdleStartMilliseconds() returns Long.MIN_VALUE if there are no executors. If there are executors, then the greatest of the last executed job time and the executor's owning computer's getConnectTime() are returned. The executor's last executed job time is initialized to the executor's creation time. The Computer's connectTime is initialized to zero, but is set to the current time when proactive launch begins. So unless a node is actively running a job on an executor, it's reported as having been idle since creation, even if it's busily running init scripts and other setup so the slave agent hasn't connected and created executors yet. That's why it's being stopped again before it's started properly. I think the best fix for this will be for EC2Computer to @Override getIdleStartMilliseconds(...) and report a time in the future (i.e. busy) while the node is still starting and hasn't yet connected its slave agent. Or at least until the init script finishes running, in EC2UnixLauncher.launch(...) It looks like testing isOffline() should be sufficient for that - if the node is offline, it's clearly not a candidate for idle shutdown. To be really picky, we could instead have the ComputerLauncher set a flag on the Computer when init scripts have finished and the slave agent has been launched. However, that's the point at which a channel is registered and isOffline becomes false, so there seems little point.
            Hide
            ringerc Craig Ringer added a comment -

            Actually, it's better to suppress idle timeout in the retention manager, rather than playing with the reported idle time.

            See https://github.com/jenkinsci/ec2-plugin/pull/102

            Show
            ringerc Craig Ringer added a comment - Actually, it's better to suppress idle timeout in the retention manager, rather than playing with the reported idle time. See https://github.com/jenkinsci/ec2-plugin/pull/102
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Craig Ringer
            Path:
            src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java
            http://jenkins-ci.org/commit/ec2-plugin/d0cd0db5b845c747607a60759eebb7d31e4ea32b
            Log:
            Prevent nodes that are still starting up from being idle-stopped

            Per JENKINS-23792, the EC2 plugin will shut down nodes that're still starting
            up if the idle timeout is shorter than the time the node takes to go from
            launch request to successfully starting its first job on an executor.

            To prevent this, don't perform idle shutdown on a node that is marked offline.
            When it comes online, executors will be created and the new idle time will
            become the executor creation time, effectively resetting the timer.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Craig Ringer Path: src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java http://jenkins-ci.org/commit/ec2-plugin/d0cd0db5b845c747607a60759eebb7d31e4ea32b Log: Prevent nodes that are still starting up from being idle-stopped Per JENKINS-23792 , the EC2 plugin will shut down nodes that're still starting up if the idle timeout is shorter than the time the node takes to go from launch request to successfully starting its first job on an executor. To prevent this, don't perform idle shutdown on a node that is marked offline. When it comes online, executors will be created and the new idle time will become the executor creation time, effectively resetting the timer.

              People

              • Assignee:
                francisu Francis Upton
                Reporter:
                ringerc Craig Ringer
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: