Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-57795

Orphaned EC2 instances after Jenkins restart

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Component/s: ec2-plugin
    • Labels:
      None
    • Environment:
      Jenkins ver. 2.176.1
      ec2 plugin 1.43, 1.44, 1.45
    • Similar Issues:

      Description

      Sometimes after a Jenkins restart the plugin won't be able to spawn more agents.

      The plugin will just loop on this:

      SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Attempting to provision slave needed by excess workload of 1 units
      May 31, 2019 2:23:53 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlave
      SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Cannot provision - no capacity for instances: 0
      May 31, 2019 2:23:53 PM WARNING hudson.plugins.ec2.EC2Cloud provision
      Can't raise nodes for SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}
      

      If I go to the EC2 console and terminate the instance manually the plugin will spawn a new one and use it.

      It seems like there is some mismatch in the plugin logic. The part responsible for calculating the number of instances and checking the cap sees the EC2 instance. However the part responsible for picking up running EC2 instances doesn't seem to be able to find it.

      We use a single subnet, security group and vpc (I've seen some reports about this causing problems).

      We use instanceCap = 1 setting as we are testing the plugin, this might make this problem more visible than with a higher cap.

        Attachments

          Activity

          Hide
          jbochenski Jakub Bochenski added a comment -

          Is there a way to reproduce this consistently from your end? Also what is the number of instances that get run in your setup?

          It happens quite often after restart but I have no way to reproduce it 100%.

          Maybe the fact that the instance counting logic sees the EC2 machine (since it says there is no capacity), but the attaching node can't connect it for some reason would be a hint?

          lso what is the number of instances that get run in your setup?

          I'm not sure if I understand. The instance cap is 1 so we have at most 1 instance.

          Show
          jbochenski Jakub Bochenski added a comment - Is there a way to reproduce this consistently from your end? Also what is the number of instances that get run in your setup? It happens quite often after restart but I have no way to reproduce it 100%. Maybe the fact that the instance counting logic sees the EC2 machine (since it says there is no capacity), but the attaching node can't connect it for some reason would be a hint? lso what is the number of instances that get run in your setup? I'm not sure if I understand. The instance cap is 1 so we have at most 1 instance.
          Hide
          raihaan Raihaan Shouhell added a comment -

          From this log
          ```
          WARNING: SlaveTemplate

          {ami='ami-****', labels='build-yocto-persistent'}

          . Node stopped is neither pending, neither running, its {2}. Terminate provisioning
          ```
          It says that the node has been stopped. Btw are you on ondemand slaves or spots

          Show
          raihaan Raihaan Shouhell added a comment - From this log ``` WARNING: SlaveTemplate {ami='ami-****', labels='build-yocto-persistent'} . Node stopped is neither pending, neither running, its {2}. Terminate provisioning ``` It says that the node has been stopped. Btw are you on ondemand slaves or spots
          Hide
          sirzic cedric lecoz added a comment -

          Hi Raihaan Shouhell,
          That's what it says, but the EC2 instance was alive, I could ssh and work away, it's just that Jenkins was not aware of it. my instances are on demand.
          Attached jenkins_201909121030.log a log with a bit more data 2 differences EC2 had the issue (or similar) the first one (aws-audit-ec2) had an EC2 running and terminated an hour before (so it was still showing as terminated in my EC2 console). the second one the EC2 already existed, and was just stopped. I tried to clean the log at best, but I have too many jobs running on other ec2, it's noisy.
          BR,
          Cedric.

          Show
          sirzic cedric lecoz added a comment - Hi Raihaan Shouhell , That's what it says, but the EC2 instance was alive, I could ssh and work away, it's just that Jenkins was not aware of it. my instances are on demand. Attached jenkins_201909121030.log a log with a bit more data 2 differences EC2 had the issue (or similar) the first one (aws-audit-ec2) had an EC2 running and terminated an hour before (so it was still showing as terminated in my EC2 console). the second one the EC2 already existed, and was just stopped. I tried to clean the log at best, but I have too many jobs running on other ec2, it's noisy. BR, Cedric.
          Hide
          raihaan Raihaan Shouhell added a comment -

          cedric lecoz Jakub Bochenski could someone test   https://ci.jenkins.io/job/Plugins/job/ec2-plugin/job/PR-397/2/artifact/org/jenkins-ci/plugins/ec2/1.46-rc1050.a8a95e8dd7f5/ec2-1.46-rc1050.a8a95e8dd7f5.hpi and see if this issue still occurs? This retries on missing instances instead of giving up immediately.

          Show
          raihaan Raihaan Shouhell added a comment - cedric lecoz Jakub Bochenski could someone test   https://ci.jenkins.io/job/Plugins/job/ec2-plugin/job/PR-397/2/artifact/org/jenkins-ci/plugins/ec2/1.46-rc1050.a8a95e8dd7f5/ec2-1.46-rc1050.a8a95e8dd7f5.hpi and see if this issue still occurs? This retries on missing instances instead of giving up immediately.
          Hide
          sirzic cedric lecoz added a comment -

          HI Raihaan Shouhell,
          I updated to that new version this morning, few tests I did were ok. I added a job which will start / destroy / start ..an ec2 using the plugin every 15min, and asked the team to ping me if they see the problem happen. if we don't see the problem, will try to update this ticket by next Thursday.
          BR,
          Cedric.

          Show
          sirzic cedric lecoz added a comment - HI Raihaan Shouhell , I updated to that new version this morning, few tests I did were ok. I added a job which will start / destroy / start ..an ec2 using the plugin every 15min, and asked the team to ping me if they see the problem happen. if we don't see the problem, will try to update this ticket by next Thursday. BR, Cedric.

            People

            • Assignee:
              thoulen FABRIZIO MANFREDI
              Reporter:
              jbochenski Jakub Bochenski
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: