Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-30284

EC2 plugin too aggressive in timing in contacting new AWS instance over SSH

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Component/s: ec2-plugin
    • Labels:
      None
    • Environment:
      EC2 Plugin v 1.29
      Jenkins 1.624
    • Similar Issues:

      Description

      In every 10 or so instance launches, I see cases where the EC2 plugin opens a SSH connection to the new instance before the time where I believe the corresponding private key has been put in place by the AWS infrastructure during the launch of the instance. As a result, we see errors in the node launch. The output for the launch is below.

      just before slave CentOS (i-f295e51b) gets launched ...
      executing pre-launch scripts ...
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is ready
      Connecting to 10.240.1.146 on port 22, with timeout 10000.
      Failed to connect via ssh: The kexTimeout (10000 ms) expired.
      Waiting for SSH to come up. Sleeping 5.
      Connecting to 10.240.1.146 on port 22, with timeout 10000.
      Failed to connect via ssh: The kexTimeout (10000 ms) expired.
      Waiting for SSH to come up. Sleeping 5.
      Connecting to 10.240.1.146 on port 22, with timeout 10000.
      Failed to connect via ssh: The kexTimeout (10000 ms) expired.
      Waiting for SSH to come up. Sleeping 5.
      Connecting to 10.240.1.146 on port 22, with timeout 10000.
      Connected via SSH.
      bootstrap()
      Getting keypair...
      Using key: jenkins
      dc:xx:xx:xx
      ----BEGIN RSA PRIVATE KEY----
      MIIEow<private key info>
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      ERROR: Publickey authentication failed.
      java.io.IOException: Publickey authentication failed.
      at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:315)
      at com.trilead.ssh2.Connection.authenticateWithPublicKey(Connection.java:467)
      at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:260)
      at hudson.plugins.ec2.ssh.EC2UnixLauncher.launch(EC2UnixLauncher.java:91)
      at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:107)
      at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:238)
      at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: The connection is closed.
      at com.trilead.ssh2.auth.AuthenticationManager.deQueue(AuthenticationManager.java:63)
      at com.trilead.ssh2.auth.AuthenticationManager.getNextMessage(AuthenticationManager.java:86)
      at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:290)
      ... 10 more
      Caused by: java.io.IOException: Peer sent DISCONNECT message (reason code 2): Too many authentication failures for centos
      at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766)
      at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489)
      ... 1 more

        Attachments

          Issue Links

            Activity

            mkingsbury Mike Kingsbury created issue -
            Hide
            tenyo Tenyo Grozev added a comment -

            We are running into the same problem with a CentOS slave (not sure if the OS matters) and it seems to be approximately 1 in 10 or so. Our output looks the same as the one from the description above.
            When that happens the affected slave just says "Ping response time is too long or timed out." and hangs around until we delete it.

            Show
            tenyo Tenyo Grozev added a comment - We are running into the same problem with a CentOS slave (not sure if the OS matters) and it seems to be approximately 1 in 10 or so. Our output looks the same as the one from the description above. When that happens the affected slave just says "Ping response time is too long or timed out." and hangs around until we delete it.
            Hide
            fishnix E Camden Fisher added a comment -

            Same issue here, running CentOS 7 slaves. Output looks the same as above please make these timing configurable! Thanks

            Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
            Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
            Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
            Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
            Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
            Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
            Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is ready
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Failed to connect via ssh: The kexTimeout (10000 ms) expired.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
            Connected via SSH.
            bootstrap()
            Getting keypair...
            Using key: np-dev-key
            92:xx:yy:zz
            -----BEGIN RSA PRIVATE KEY-----
            MIIEoXXXX<snip>
            Authenticating as centos
            Authentication failed. Trying again...
            Authenticating as centos
            Authentication failed. Trying again...
            Authenticating as centos
            Authentication failed. Trying again...
            Authenticating as centos
            Authentication failed. Trying again...
            Authenticating as centos
            Authentication failed. Trying again...
            Authenticating as centos
            ERROR: Publickey authentication failed.
            java.io.IOException: Publickey authentication failed.
            	at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:315)
            	at com.trilead.ssh2.Connection.authenticateWithPublicKey(Connection.java:467)
            	at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:260)
            	at hudson.plugins.ec2.ssh.EC2UnixLauncher.launch(EC2UnixLauncher.java:91)
            	at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:107)
            	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:253)
            	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            	at java.lang.Thread.run(Thread.java:745)
            Caused by: java.io.IOException: The connection is closed.
            	at com.trilead.ssh2.auth.AuthenticationManager.deQueue(AuthenticationManager.java:63)
            	at com.trilead.ssh2.auth.AuthenticationManager.getNextMessage(AuthenticationManager.java:86)
            	at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:290)
            	... 10 more
            Caused by: java.io.IOException: Peer sent DISCONNECT message (reason code 2): Too many authentication failures for centos
            	at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766)
            	at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489)
            	... 1 more
            
            Show
            fishnix E Camden Fisher added a comment - Same issue here, running CentOS 7 slaves. Output looks the same as above please make these timing configurable! Thanks Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is ready Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Connected via SSH. bootstrap() Getting keypair... Using key: np-dev-key 92:xx:yy:zz -----BEGIN RSA PRIVATE KEY----- MIIEoXXXX<snip> Authenticating as centos Authentication failed. Trying again... Authenticating as centos Authentication failed. Trying again... Authenticating as centos Authentication failed. Trying again... Authenticating as centos Authentication failed. Trying again... Authenticating as centos Authentication failed. Trying again... Authenticating as centos ERROR: Publickey authentication failed. java.io.IOException: Publickey authentication failed. at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:315) at com.trilead.ssh2.Connection.authenticateWithPublicKey(Connection.java:467) at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:260) at hudson.plugins.ec2.ssh.EC2UnixLauncher.launch(EC2UnixLauncher.java:91) at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:107) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:253) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Caused by: java.io.IOException: The connection is closed. at com.trilead.ssh2.auth.AuthenticationManager.deQueue(AuthenticationManager.java:63) at com.trilead.ssh2.auth.AuthenticationManager.getNextMessage(AuthenticationManager.java:86) at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:290) ... 10 more Caused by: java.io.IOException: Peer sent DISCONNECT message (reason code 2): Too many authentication failures for centos at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766) at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489) ... 1 more
            Hide
            fishnix E Camden Fisher added a comment -

            This seems to be getting worse - maybe extra load on AWS. Even having the ability to tweak these timeouts would be very helpful. I'm manually killing off 10 - 15 slaves a day now.

            Show
            fishnix E Camden Fisher added a comment - This seems to be getting worse - maybe extra load on AWS. Even having the ability to tweak these timeouts would be very helpful. I'm manually killing off 10 - 15 slaves a day now.
            Hide
            francisu Francis Upton added a comment -

            Currently the timeout for this is set at 10000ms (10 seconds). Should the default be more? We can certainly put a mechanism in to make this configurable.

            Show
            francisu Francis Upton added a comment - Currently the timeout for this is set at 10000ms (10 seconds). Should the default be more? We can certainly put a mechanism in to make this configurable.
            Hide
            fishnix E Camden Fisher added a comment -

            Looking more closely at this, I'm not 100% sure it's related to a "timeout", but maybe a chicken and egg situation where AWS hasn't dropped the SSH key yet. Would it be possible to retry that SSH some configurable number of times if it fails?

            Show
            fishnix E Camden Fisher added a comment - Looking more closely at this, I'm not 100% sure it's related to a "timeout", but maybe a chicken and egg situation where AWS hasn't dropped the SSH key yet. Would it be possible to retry that SSH some configurable number of times if it fails?
            Hide
            francisu Francis Upton added a comment - - edited

            Below is our code. And it looks like the problem is we are retrying too frequently and have 5 failures and centos does not like this. But I'm not sure what we are waiting for to change on the slave so that the authentication will eventually succeed. I mean 10 seconds between each try seems like a pretty long time.

            Any ideas of what it's waiting for?

            private int bootstrap(Connection bootstrapConn, EC2Computer computer, PrintStream logger) throws IOException,
                        InterruptedException, AmazonClientException {
                    logger.println("bootstrap()");
                    boolean closeBootstrap = true;
                    try {
                        int tries = 20;
                        boolean isAuthenticated = false;
                        logger.println("Getting keypair...");
                        KeyPair key = computer.getCloud().getKeyPair();
                        logger.println("Using key: " + key.getKeyName() + "\n" + key.getKeyFingerprint() + "\n"
                                + key.getKeyMaterial().substring(0, 160));
                        while (tries-- > 0) {
                            logger.println("Authenticating as " + computer.getRemoteAdmin());
                            isAuthenticated = bootstrapConn.authenticateWithPublicKey(computer.getRemoteAdmin(), key.getKeyMaterial().toCharArray(), "");
                            if (isAuthenticated) {
                                break;
                            }
                            logger.println("Authentication failed. Trying again...");
                            Thread.sleep(10000);
                        }
                        if (!isAuthenticated) {
                            logger.println("Authentication failed");
                            return FAILED;
                        }
                        closeBootstrap = false;
                        return SAMEUSER;
                    } finally {
                        if (closeBootstrap)
                            bootstrapConn.close();
                    }
                }
            
            Show
            francisu Francis Upton added a comment - - edited Below is our code. And it looks like the problem is we are retrying too frequently and have 5 failures and centos does not like this. But I'm not sure what we are waiting for to change on the slave so that the authentication will eventually succeed. I mean 10 seconds between each try seems like a pretty long time. Any ideas of what it's waiting for? private int bootstrap(Connection bootstrapConn, EC2Computer computer, PrintStream logger) throws IOException, InterruptedException, AmazonClientException { logger.println( "bootstrap()" ); boolean closeBootstrap = true ; try { int tries = 20; boolean isAuthenticated = false ; logger.println( "Getting keypair..." ); KeyPair key = computer.getCloud().getKeyPair(); logger.println( "Using key: " + key.getKeyName() + "\n" + key.getKeyFingerprint() + "\n" + key.getKeyMaterial().substring(0, 160)); while (tries-- > 0) { logger.println( "Authenticating as " + computer.getRemoteAdmin()); isAuthenticated = bootstrapConn.authenticateWithPublicKey(computer.getRemoteAdmin(), key.getKeyMaterial().toCharArray(), ""); if (isAuthenticated) { break ; } logger.println( "Authentication failed. Trying again..." ); Thread .sleep(10000); } if (!isAuthenticated) { logger.println( "Authentication failed" ); return FAILED; } closeBootstrap = false ; return SAMEUSER; } finally { if (closeBootstrap) bootstrapConn.close(); } }
            Hide
            fishnix E Camden Fisher added a comment - - edited

            I'm not totally sure, but it looks like I see OpenSSH start

            Jan  7 19:39:14 ip-10-1-1-219 systemd: Started OpenSSH server daemon.
            

            and then get restarted later by cloud-init when the authorized_keys file is put in place.

            Jan  7 19:39:31 ip-10-1-1-219 systemd: Started Initial cloud-init job (metadata service crawler).
            Jan  7 19:39:31 ip-10-1-1-219 systemd: Starting Cloud-config availability.
            Jan  7 19:39:31 ip-10-1-1-219 systemd: Reached target Cloud-config availability.
            Jan  7 19:39:31 ip-10-1-1-219 systemd: Starting Apply the settings specified in cloud-config...
            Jan  7 19:39:32 ip-10-1-1-219 systemd: Stopping OpenSSH server daemon...
            Jan  7 19:39:32 ip-10-1-1-219 systemd: Started OpenSSH Server Key Generation.
            Jan  7 19:39:32 ip-10-1-1-219 systemd: Starting OpenSSH server daemon...
            Jan  7 19:39:32 ip-10-1-1-219 systemd: Started OpenSSH server daemon.
            Jan  7 19:39:32 ip-10-1-1-219 systemd: Started Apply the settings specified in cloud-config.
            Jan  7 19:39:32 ip-10-1-1-219 systemd: Starting Execute cloud user/final scripts...
            Jan  7 19:39:32 ip-10-1-1-219 ec2:
            Jan  7 19:39:32 ip-10-1-1-219 ec2: #############################################################
            Jan  7 19:39:32 ip-10-1-1-219 ec2: -----BEGIN SSH HOST KEY FINGERPRINTS-----
            Jan  7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx   (ECDSA)
            Jan  7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx   (ED25519)
            Jan  7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx   (RSA)
            Jan  7 19:39:32 ip-10-1-1-219 ec2: -----END SSH HOST KEY FINGERPRINTS-----
            Jan  7 19:39:32 ip-10-1-1-219 ec2: #############################################################
            Jan  7 19:39:33 ip-10-1-1-219 systemd: Started Execute cloud user/final scripts.
            

            Maybe you are starting your authentication attempts during that first phase of SSH being up which eventually times out or centos says 'go away'.

            EDIT: the above logs are from a slave that succeeded by the way, so those times could be much different in the failure case.

            Show
            fishnix E Camden Fisher added a comment - - edited I'm not totally sure, but it looks like I see OpenSSH start Jan 7 19:39:14 ip-10-1-1-219 systemd: Started OpenSSH server daemon. and then get restarted later by cloud-init when the authorized_keys file is put in place. Jan 7 19:39:31 ip-10-1-1-219 systemd: Started Initial cloud-init job (metadata service crawler). Jan 7 19:39:31 ip-10-1-1-219 systemd: Starting Cloud-config availability. Jan 7 19:39:31 ip-10-1-1-219 systemd: Reached target Cloud-config availability. Jan 7 19:39:31 ip-10-1-1-219 systemd: Starting Apply the settings specified in cloud-config... Jan 7 19:39:32 ip-10-1-1-219 systemd: Stopping OpenSSH server daemon... Jan 7 19:39:32 ip-10-1-1-219 systemd: Started OpenSSH Server Key Generation. Jan 7 19:39:32 ip-10-1-1-219 systemd: Starting OpenSSH server daemon... Jan 7 19:39:32 ip-10-1-1-219 systemd: Started OpenSSH server daemon. Jan 7 19:39:32 ip-10-1-1-219 systemd: Started Apply the settings specified in cloud-config. Jan 7 19:39:32 ip-10-1-1-219 systemd: Starting Execute cloud user/final scripts... Jan 7 19:39:32 ip-10-1-1-219 ec2: Jan 7 19:39:32 ip-10-1-1-219 ec2: ############################################################# Jan 7 19:39:32 ip-10-1-1-219 ec2: -----BEGIN SSH HOST KEY FINGERPRINTS----- Jan 7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx (ECDSA) Jan 7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx (ED25519) Jan 7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx (RSA) Jan 7 19:39:32 ip-10-1-1-219 ec2: -----END SSH HOST KEY FINGERPRINTS----- Jan 7 19:39:32 ip-10-1-1-219 ec2: ############################################################# Jan 7 19:39:33 ip-10-1-1-219 systemd: Started Execute cloud user/final scripts. Maybe you are starting your authentication attempts during that first phase of SSH being up which eventually times out or centos says 'go away'. EDIT: the above logs are from a slave that succeeded by the way, so those times could be much different in the failure case.
            Hide
            francisu Francis Upton added a comment -

            Thanks, that's helpful. Any way of getting the logs on both sides from the failure case (of the same failure)?

            Show
            francisu Francis Upton added a comment - Thanks, that's helpful. Any way of getting the logs on both sides from the failure case (of the same failure)?
            Hide
            fishnix E Camden Fisher added a comment -

            Sure, I'll collect logs on the next failure. Unfortunately, it looks like the slave.log's don't have timestamps.

            Show
            fishnix E Camden Fisher added a comment - Sure, I'll collect logs on the next failure. Unfortunately, it looks like the slave.log's don't have timestamps.
            Hide
            fishnix E Camden Fisher added a comment - - edited

            So it's a little tricky to wade through the log output, but this (I think) illustrates my thought (from a failed slave):

            Initial OpenSSH startup on slave:

            Jan  7 20:33:08 ip-10-1-1-219 systemd: Starting OpenSSH server daemon...
            Jan  7 20:33:09 ip-10-1-1-219 systemd: Started OpenSSH server daemon.
            

            Slave failure in jenkins.log:

            Jan 07, 2016 8:34:31 PM hudson.slaves.NodeProvisioner$2 run
            WARNING: Provisioned slave Jenkins Docker Slave (ami-8ad0a2e0) failed to launch
            java.util.concurrent.ExecutionException: java.io.IOException: Slave failed to connect, even though the launcher didn't report it. See the log output for details.
            	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
            	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
            	at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:409)
            	at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:394)
            	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            	at java.lang.Thread.run(Thread.java:745)
            Caused by: java.io.IOException: Slave failed to connect, even though the launcher didn't report it. See the log output for details.
            	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:277)
            	... 5 more
            
            Jan 07, 2016 8:34:41 PM hudson.plugins.ec2.EC2Cloud provision
            

            Cloud-init on slave:

            Jan  7 20:34:47 ip-10-1-1-219 systemd: Started Initial cloud-init job (metadata service crawler).
            Jan  7 20:34:47 ip-10-1-1-219 systemd: Starting Cloud-config availability.
            Jan  7 20:34:47 ip-10-1-1-219 systemd: Reached target Cloud-config availability.
            Jan  7 20:34:47 ip-10-1-1-219 systemd: Starting Apply the settings specified in cloud-config...
            Jan  7 20:34:51 ip-10-1-1-219 systemd: Stopping OpenSSH server daemon...
            Jan  7 20:34:51 ip-10-1-1-219 systemd: Started OpenSSH Server Key Generation.
            Jan  7 20:34:51 ip-10-1-1-219 systemd: Starting OpenSSH server daemon...
            Jan  7 20:34:51 ip-10-1-1-219 systemd: Started OpenSSH server daemon.
            Jan  7 20:34:51 ip-10-1-1-219 systemd: Started Apply the settings specified in cloud-config.
            Jan  7 20:34:51 ip-10-1-1-219 systemd: Starting Execute cloud user/final scripts...
            Jan  7 20:34:52 ip-10-1-1-219 ec2: 
            Jan  7 20:34:52 ip-10-1-1-219 ec2: #############################################################
            Jan  7 20:34:52 ip-10-1-1-219 ec2: -----BEGIN SSH HOST KEY FINGERPRINTS-----
            Jan  7 20:34:53 ip-10-1-1-219 ec2: 256 xxxxxxxx   (ECDSA)
            Jan  7 20:34:53 ip-10-1-1-219 ec2: 256 xxxxxxxx   (ED25519)
            Jan  7 20:34:53 ip-10-1-1-219 ec2: 2048 xxxxxxxx   (RSA)
            Jan  7 20:34:53 ip-10-1-1-219 ec2: -----END SSH HOST KEY FINGERPRINTS-----
            Jan  7 20:34:53 ip-10-1-1-219 ec2: #############################################################
            Jan  7 20:34:53 ip-10-1-1-219 systemd: Started Execute cloud user/final scripts.
            Jan  7 20:34:53 ip-10-1-1-219 systemd: Starting Multi-User System.
            Jan  7 20:34:53 ip-10-1-1-219 systemd: Reached target Multi-User System.
            Jan  7 20:34:53 ip-10-1-1-219 systemd: Starting Update UTMP about System Runlevel Changes...
            Jan  7 20:34:53 ip-10-1-1-219 systemd: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
            Jan  7 20:34:53 ip-10-1-1-219 systemd: Started Update UTMP about System Runlevel Changes.
            Jan  7 20:34:53 ip-10-1-1-219 systemd: Startup finished in 1.528s (kernel) + 35.815s (initrd) + 3min 18.160s (userspace) = 3min 55.504s.
            
            Show
            fishnix E Camden Fisher added a comment - - edited So it's a little tricky to wade through the log output, but this (I think) illustrates my thought (from a failed slave): Initial OpenSSH startup on slave: Jan 7 20:33:08 ip-10-1-1-219 systemd: Starting OpenSSH server daemon... Jan 7 20:33:09 ip-10-1-1-219 systemd: Started OpenSSH server daemon. Slave failure in jenkins.log: Jan 07, 2016 8:34:31 PM hudson.slaves.NodeProvisioner$2 run WARNING: Provisioned slave Jenkins Docker Slave (ami-8ad0a2e0) failed to launch java.util.concurrent.ExecutionException: java.io.IOException: Slave failed to connect, even though the launcher didn't report it. See the log output for details. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:409) at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:394) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Slave failed to connect, even though the launcher didn't report it. See the log output for details. at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:277) ... 5 more Jan 07, 2016 8:34:41 PM hudson.plugins.ec2.EC2Cloud provision Cloud-init on slave: Jan 7 20:34:47 ip-10-1-1-219 systemd: Started Initial cloud-init job (metadata service crawler). Jan 7 20:34:47 ip-10-1-1-219 systemd: Starting Cloud-config availability. Jan 7 20:34:47 ip-10-1-1-219 systemd: Reached target Cloud-config availability. Jan 7 20:34:47 ip-10-1-1-219 systemd: Starting Apply the settings specified in cloud-config... Jan 7 20:34:51 ip-10-1-1-219 systemd: Stopping OpenSSH server daemon... Jan 7 20:34:51 ip-10-1-1-219 systemd: Started OpenSSH Server Key Generation. Jan 7 20:34:51 ip-10-1-1-219 systemd: Starting OpenSSH server daemon... Jan 7 20:34:51 ip-10-1-1-219 systemd: Started OpenSSH server daemon. Jan 7 20:34:51 ip-10-1-1-219 systemd: Started Apply the settings specified in cloud-config. Jan 7 20:34:51 ip-10-1-1-219 systemd: Starting Execute cloud user/final scripts... Jan 7 20:34:52 ip-10-1-1-219 ec2: Jan 7 20:34:52 ip-10-1-1-219 ec2: ############################################################# Jan 7 20:34:52 ip-10-1-1-219 ec2: -----BEGIN SSH HOST KEY FINGERPRINTS----- Jan 7 20:34:53 ip-10-1-1-219 ec2: 256 xxxxxxxx (ECDSA) Jan 7 20:34:53 ip-10-1-1-219 ec2: 256 xxxxxxxx (ED25519) Jan 7 20:34:53 ip-10-1-1-219 ec2: 2048 xxxxxxxx (RSA) Jan 7 20:34:53 ip-10-1-1-219 ec2: -----END SSH HOST KEY FINGERPRINTS----- Jan 7 20:34:53 ip-10-1-1-219 ec2: ############################################################# Jan 7 20:34:53 ip-10-1-1-219 systemd: Started Execute cloud user/final scripts. Jan 7 20:34:53 ip-10-1-1-219 systemd: Starting Multi-User System. Jan 7 20:34:53 ip-10-1-1-219 systemd: Reached target Multi-User System. Jan 7 20:34:53 ip-10-1-1-219 systemd: Starting Update UTMP about System Runlevel Changes... Jan 7 20:34:53 ip-10-1-1-219 systemd: Started Stop Read-Ahead Data Collection 10s After Completed Startup. Jan 7 20:34:53 ip-10-1-1-219 systemd: Started Update UTMP about System Runlevel Changes. Jan 7 20:34:53 ip-10-1-1-219 systemd: Startup finished in 1.528s (kernel) + 35.815s (initrd) + 3min 18.160s (userspace) = 3min 55.504s.
            francisu Francis Upton made changes -
            Field Original Value New Value
            Link This issue is duplicated by JENKINS-21891 [ JENKINS-21891 ]
            Hide
            fishnix E Camden Fisher added a comment -

            Hi -
            Sorry to be a pest, but is there any forward motion here? I'm going crazy killing dead slaves.... Thanks!
            -Camden

            Show
            fishnix E Camden Fisher added a comment - Hi - Sorry to be a pest, but is there any forward motion here? I'm going crazy killing dead slaves.... Thanks! -Camden
            Hide
            francisu Francis Upton added a comment -

            Hey Camden,
            Would you be OK if I gave you a couple of system properties that you can set to control the time to wait and the number of times to loop? And also I fixed the logging so that it has timestamps on it? I can put this in today and you can take a snapshot build (let me know if you need help to get the hpi file) and try it out to see if it fixes your problem before I release it. If it's truly the timeout value that sometimes needs to be different depending on the OS or such, then I can move it form a system property to something in the UI, but let's confirm that's the case first.
            Let me know.
            Francis

            Show
            francisu Francis Upton added a comment - Hey Camden, Would you be OK if I gave you a couple of system properties that you can set to control the time to wait and the number of times to loop? And also I fixed the logging so that it has timestamps on it? I can put this in today and you can take a snapshot build (let me know if you need help to get the hpi file) and try it out to see if it fixes your problem before I release it. If it's truly the timeout value that sometimes needs to be different depending on the OS or such, then I can move it form a system property to something in the UI, but let's confirm that's the case first. Let me know. Francis
            Hide
            fishnix E Camden Fisher added a comment -

            Yeah that works for me. Let me know how to find the hpi. Thanks!
            -Camden

            Show
            fishnix E Camden Fisher added a comment - Yeah that works for me. Let me know how to find the hpi. Thanks! -Camden
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Francis Upton IV
            Path:
            src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java
            src/main/java/hudson/plugins/ec2/EC2Cloud.java
            src/main/java/hudson/plugins/ec2/EC2Computer.java
            src/main/java/hudson/plugins/ec2/EC2ComputerLauncher.java
            src/main/java/hudson/plugins/ec2/ssh/EC2UnixLauncher.java
            http://jenkins-ci.org/commit/ec2-plugin/86f1defc1edb6c85f6e0a1ab99c969e875148f72
            Log:
            JENKINS-32439 Incorrect slave template (AMI) found when launching slave
            JENKINS-30284 EC2 plugin too aggressive in timing in contacting new AWS instance over SSH (parameterized timeouts, added logging)

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Francis Upton IV Path: src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java src/main/java/hudson/plugins/ec2/EC2Cloud.java src/main/java/hudson/plugins/ec2/EC2Computer.java src/main/java/hudson/plugins/ec2/EC2ComputerLauncher.java src/main/java/hudson/plugins/ec2/ssh/EC2UnixLauncher.java http://jenkins-ci.org/commit/ec2-plugin/86f1defc1edb6c85f6e0a1ab99c969e875148f72 Log: JENKINS-32439 Incorrect slave template (AMI) found when launching slave JENKINS-30284 EC2 plugin too aggressive in timing in contacting new AWS instance over SSH (parameterized timeouts, added logging)
            Hide
            francisu Francis Upton added a comment -

            You can get the hpi here.

            https://jenkins.ci.cloudbees.com/job/plugins/job/ec2-plugin/lastBuild/org.jenkins-ci.plugins$ec2/artifact/org.jenkins-ci.plugins/ec2/1.30-SNAPSHOT/ec2-1.30-SNAPSHOT.hpi

            The properties are (with their default values [these were the existing values])

            private static final String BOOTSTRAP_AUTH_SLEEP_MS = "jenkins.ec2.bootstrapAuthSleepMs";
            private static final String BOOTSTRAP_AUTH_TRIES= "jenkins.ec2.bootstrapAuthTries";

            private static int bootstrapAuthSleepMs = 10000;
            private static int bootstrapAuthTries = 20;

            Show
            francisu Francis Upton added a comment - You can get the hpi here. https://jenkins.ci.cloudbees.com/job/plugins/job/ec2-plugin/lastBuild/org.jenkins-ci.plugins$ec2/artifact/org.jenkins-ci.plugins/ec2/1.30-SNAPSHOT/ec2-1.30-SNAPSHOT.hpi The properties are (with their default values [these were the existing values] ) private static final String BOOTSTRAP_AUTH_SLEEP_MS = "jenkins.ec2.bootstrapAuthSleepMs"; private static final String BOOTSTRAP_AUTH_TRIES= "jenkins.ec2.bootstrapAuthTries"; private static int bootstrapAuthSleepMs = 10000; private static int bootstrapAuthTries = 20;
            Hide
            fishnix E Camden Fisher added a comment -

            Thanks Francis - we are now running the snapshot. I'll let you know how it goes!

            Show
            fishnix E Camden Fisher added a comment - Thanks Francis - we are now running the snapshot. I'll let you know how it goes!
            Hide
            fishnix E Camden Fisher added a comment -

            Hi Francis - Just a quick update. Bumping the sleep to 30000 and the tries to 30 has completely solved my problem so far. Cheers!

            Show
            fishnix E Camden Fisher added a comment - Hi Francis - Just a quick update. Bumping the sleep to 30000 and the tries to 30 has completely solved my problem so far. Cheers!
            Hide
            fishnix E Camden Fisher added a comment -

            Hi again -
            It's been a few days now and this has completely resolved the issues we were seeing. I would :+1: some UI around these options and allowing them to be set.
            Cheers,
            -Camden

            Show
            fishnix E Camden Fisher added a comment - Hi again - It's been a few days now and this has completely resolved the issues we were seeing. I would :+1: some UI around these options and allowing them to be set. Cheers, -Camden
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Francis Upton IV
            Path:
            src/main/java/hudson/plugins/ec2/ssh/EC2UnixLauncher.java
            http://jenkins-ci.org/commit/ec2-plugin/06c7369977c01520b1d89a3a868d779cdd2600c9
            Log:
            JENKINS-30284 EC2 plugin too aggressive in timing in contacting new AWS instance over SSH (changed default timeouts)

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Francis Upton IV Path: src/main/java/hudson/plugins/ec2/ssh/EC2UnixLauncher.java http://jenkins-ci.org/commit/ec2-plugin/06c7369977c01520b1d89a3a868d779cdd2600c9 Log: JENKINS-30284 EC2 plugin too aggressive in timing in contacting new AWS instance over SSH (changed default timeouts)
            Hide
            francisu Francis Upton added a comment -

            I think I'm going to change the defaults to 30000/30 and leave the properties as they are. There is already a lot of obscure UI around various things and I think since this can be tweaked by the properties if necessary (and maybe it won't even be necessary with better defaults) I'm reluctant to expose it in the UI. If people really want then, I can do it, but let's wait and see on that. I will document the properties on the Wiki.

            Show
            francisu Francis Upton added a comment - I think I'm going to change the defaults to 30000/30 and leave the properties as they are. There is already a lot of obscure UI around various things and I think since this can be tweaked by the properties if necessary (and maybe it won't even be necessary with better defaults) I'm reluctant to expose it in the UI. If people really want then, I can do it, but let's wait and see on that. I will document the properties on the Wiki.
            Hide
            francisu Francis Upton added a comment -

            Released in 1.30

            Show
            francisu Francis Upton added a comment - Released in 1.30
            francisu Francis Upton made changes -
            Status Open [ 1 ] Closed [ 6 ]
            Resolution Fixed [ 1 ]
            francisu Francis Upton made changes -
            Link This issue is duplicated by JENKINS-24691 [ JENKINS-24691 ]
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 165378 ] JNJira + In-Review [ 209203 ]
            Hide
            lovelycuppatea Alan added a comment - - edited

            we're having an issue with the ssh client connecting before user data is done.

            how do we edit bootstrapAuthSleepMs?

             

            UPDATE: in case anyone else needs to know add -Djenkins.ec2.bootstrapAuthSleepMs=60000 to your jenkins launch command (java -jar jenkins.war -D...) to add a 60 second wait.

            Show
            lovelycuppatea Alan added a comment - - edited we're having an issue with the ssh client connecting before user data is done. how do we edit bootstrapAuthSleepMs?   UPDATE: in case anyone else needs to know add -Djenkins.ec2.bootstrapAuthSleepMs=60000 to your jenkins launch command (java -jar jenkins.war -D...) to add a 60 second wait.

              People

              • Assignee:
                francisu Francis Upton
                Reporter:
                mkingsbury Mike Kingsbury
              • Votes:
                2 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: