Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-61603

Handle Spot Instance Interruption

XMLWordPrintable

      According to the Spot Instance documentation, the instance will be notified (best effort) approximately 2 minutes before terminating a spot instance.

      Currently when the spot instance is being terminated, it will simply interrupt any executing builds, leading to a build failure, and then we have to restart the build. Additionally, the slave's executors remain online during the 2-minute-warning period, so they are available to take new builds, even though it will be terminated imminently.

      By monitoring the instance-action metadata (or CloudWatch events), we can receive notice that AWS is about to terminate a spot instance, and let the master react by taking the remaining executors offline (i.e., similar to the "Mark this node temporarily offline" button in the node status screen). That will do two things:

      1. Give visual notice in the Jenkins UI, that a slave is intentionally going offline
      2. Prevents any additional jobs being scheduled on that slave, allowing the built-in scheduling to route it to another online host, or possibly bring up a new instance to take its place

      We can then add configuration options to the SlaveTemplate to forcefully abort (setting status=Result.ABORTED) an executing job when we get notified that the slave will be terminated, so that the build status can reflect what actually happened.

      Some ideas for configuration options and jelly template:

      [  ] Monitor for spot instance interruption notifications  [default=false, making it opt-in]
              (?) help links to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#spot-instance-termination-notices
      <block  when checkbox is checked>
          Polling Interval (in seconds):  [Default = 5, based on documented recommendation]
          On Terminate / Stop:   [select multiple actions, default=Do Nothing]
          <!-- List of actions include:  Do nothing;  Take slave offline;  Abort builds; ...? -->
          <!-- Abort build options: -->
                  Abort Builds
                      When to abort:  [Immediately;  N seconds after notice;  N seconds before termination deadline]
                      <!-- Could be handled similar to "Idle Timeout", for example:
                              "0 => immediately"
                              "15 => 15 seconds after notice
                              "-15 => 15 seconds before termination deadline
                        -->
      </block>
      

      I haven't yet looked into what the implementation would look like, but if I get a chance, I will look into it and see if I can get a PR together.

            thoulen FABRIZIO MANFREDI
            jhansche Joe Hansche
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: