Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53810

Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

    Details

    • Type: Bug
    • Status: Fixed but Unreleased (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Labels:
    • Environment:
    • Similar Issues:

      Description

      Launching node/agent fails with

      ERROR: null
      java.util.concurrent.CancellationException

      We have large number number of jobs in queue which gets assigned to slaves being created by Docker plugin. Even, if we try creating slave and try to launch agent, it fails.
      Note: Slave image adheres to all the requirement and works well if there is no huge queue.

      Executor Status
      SSHLauncher{host='9.47.78.144', port=32870, credentialsId='slave-test', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=210, maxNumRetries=10, retryWaitTime=15, sshHostKeyVerificationStrategy=hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy, tcpNoDelay=true, trackCredentials=true}
      [09/27/18 02:53:32] [SSH] Opening SSH connection to 9.47.78.144:32870.
      [09/27/18 02:53:32] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
      [09/27/18 02:53:32] [SSH] Authentication successful.
      [09/27/18 02:53:32] [SSH] The remote user's environment is:
      BASH=/usr/bin/bash
      BASHOPTS=cmdhist:expand_aliases:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:promptvars:sourcepath
      BASH_ALIASES=()
      BASH_ARGC=()
      BASH_ARGV=()
      BASH_CMDS=()
      BASH_EXECUTION_STRING=set
      BASH_LINENO=()
      BASH_SOURCE=()
      BASH_VERSINFO=([0]="4" [1]="2" [2]="46" [3]="1" [4]="release" [5]="s390x-ibm-linux-gnu")
      BASH_VERSION='4.2.46(1)-release'
      DIRSTACK=()
      EUID=1000
      GROUPS=()
      HOME=/home/test
      HOSTNAME=01695f4aae73
      HOSTTYPE=s390x
      IFS=$' \t\n'
      LESSOPEN='||/usr/bin/lesspipe.sh %s'
      LOGNAME=test
      MACHTYPE=s390x-ibm-linux-gnu
      MAIL=/var/mail/test
      OPTERR=1
      OPTIND=1
      OSTYPE=linux-gnu
      PATH=/usr/local/bin:/usr/bin
      PIPESTATUS=([0]="0")
      PPID=13
      PS4='+ '
      PWD=/home/test
      SHELL=/bin/bash
      SHELLOPTS=braceexpand:hashall:interactive-comments
      SHLVL=1
      SSH_CLIENT='9.42.27.56 44378 22'
      SSH_CONNECTION='9.42.27.56 44378 172.17.0.2 22'
      TERM=dumb
      UID=1000
      USER=test
      _=sudo
      [09/27/18 02:53:32] [SSH] Checking java version of /home/test/jdk/bin/java
      Couldn't figure out the Java version of /home/test/jdk/bin/java
      bash: /home/test/jdk/bin/java: No such file or directory
      
      [09/27/18 02:53:33] [SSH] Checking java version of java
      [09/27/18 02:53:34] [SSH] java -version returned 1.8.0_151.
      [09/27/18 02:53:34] [SSH] Starting sftp client.
      [09/27/18 02:53:34] [SSH] Copying latest remoting.jar...
      [09/27/18 02:53:36] [SSH] Copied 776,265 bytes.
      Expanded the channel window size to 4MB
      [09/27/18 02:53:36] [SSH] Starting agent process: cd "/home/test" && java  -jar remoting.jar -workDir /home/test
      Sep 27, 2018 6:54:09 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using /home/test/remoting as a remoting work directory
      Both error and output logs will be printed to /home/test/remoting
      ERROR: null
      java.util.concurrent.CancellationException
      	at java.util.concurrent.FutureTask.report(FutureTask.java:121)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      	at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:904)
      	at hudson.slaves.DelegatingComputerLauncher.launch(DelegatingComputerLauncher.java:64)
      	at io.jenkins.docker.connector.DockerComputerConnector$1.launch(DockerComputerConnector.java:117)
      	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294)
      	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      [09/27/18 02:57:02] Launch failed - cleaning up connection
      Slave JVM has not reported exit code. Is it still running?
      [09/27/18 02:57:02] [SSH] Connection closed.
      
      
      

        Attachments

          Issue Links

            Activity

            Show
            ifernandezcalvo Ivan Fernandez Calvo added a comment - AFAIU it uses the SSHLauncher https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L457 https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L264 https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L359 https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L407 it has his own connector but is not the one from the slaves-ssh-plugin or an extension of it or ComputerConnector, I don't really dig too much in the code. https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L61 https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerConnector.java#L28 In any case, I'm changing how the timeout works, because we fix the trilead-ssh2 lib to manage the timeout so the current behavior it is no longer valid, we have to manage the timeout per connection, not per retries block.
            Hide
            mzanetti Michael Zanetti added a comment -

            Ivan Fernandez Calvo

            In reply to your question on when exactly this happens to for me:

            So, my setup is as follows: I use AWS EC2 instances which are started on demand using the "Launch Agents via SSH" method. I also have an additional plugin installed which allows me to execute commands before the SSH connection.

            In my custom script I use "aws ec2 launch-instance..." and then "aws ec2 instance-wait ..." to wait until it has booted up. This will stall the normal "Launch agents via SSH" procedure until slaves are up and running. After that, it continues with the actual SSH call.

            Now, this works fine in 90% of the cases, however, in some cases, aws instance-wait returns before sshd on the slave is actually running and the connection attempt fails with the message:

            [07/29/19 11:57:14] [SSH] Opening SSH connection to ec2-obfuscated.eu-west-1.compute.amazonaws.com:22. Connection refused (Connection refused) SSH Connection failed with IOException: "Connection refused (Connection refused)", retrying in 15 seconds. There are 10 more retries left.
            

            This waits for 15 seconds, and tries again. The second attempt succeeds and all is fine now, jobs start building.

             

            However, this is the situation where the bug strikes. The ssh connector plugin did not stop the connection timeout for this first failed connection. The connection timeout for the first, failed connection is still running, and by default for another 3 minutes. Now the jobs start building but when the 3 minutes have passed, the currently working SSH connection is killed because the ssh-connector thinks it is still trying to establish the first connection.

             

            Hope this makes it clear enough.

            Show
            mzanetti Michael Zanetti added a comment - Ivan Fernandez Calvo ,  In reply to your question on when exactly this happens to for me: So, my setup is as follows: I use AWS EC2 instances which are started on demand using the "Launch Agents via SSH" method. I also have an additional plugin installed which allows me to execute commands before the SSH connection. In my custom script I use "aws ec2 launch-instance..." and then "aws ec2 instance-wait ..." to wait until it has booted up. This will stall the normal "Launch agents via SSH" procedure until slaves are up and running. After that, it continues with the actual SSH call. Now, this works fine in 90% of the cases, however, in some cases, aws instance-wait returns before sshd on the slave is actually running and the connection attempt fails with the message: [07/29/19 11:57:14] [SSH] Opening SSH connection to ec2-obfuscated.eu-west-1.compute.amazonaws.com:22. Connection refused (Connection refused) SSH Connection failed with IOException: "Connection refused (Connection refused)" , retrying in 15 seconds. There are 10 more retries left. This waits for 15 seconds, and tries again. The second attempt succeeds and all is fine now, jobs start building.   However, this is the situation where the bug strikes. The ssh connector plugin did not stop the connection timeout for this first failed connection. The connection timeout for the first, failed connection is still running, and by default for another 3 minutes. Now the jobs start building but when the 3 minutes have passed, the currently working SSH connection is killed because the ssh-connector thinks it is still trying to establish the first connection.   Hope this makes it clear enough.
            Hide
            ifernandezcalvo Ivan Fernandez Calvo added a comment -

            yep, it is really helpful, I'm fixing the retry method on https://issues.jenkins-ci.org/browse/JENKINS-58589, and I've found the connection was not closed in some cases, so now it forces to close the connection before to retry again, also the timeout is passed to the connection and the weird global timeout disappear, now it'd be a timeout per connection.

            Show
            ifernandezcalvo Ivan Fernandez Calvo added a comment - yep, it is really helpful, I'm fixing the retry method on https://issues.jenkins-ci.org/browse/JENKINS-58589 , and I've found the connection was not closed in some cases, so now it forces to close the connection before to retry again, also the timeout is passed to the connection and the weird global timeout disappear, now it'd be a timeout per connection.
            Hide
            ifernandezcalvo Ivan Fernandez Calvo added a comment -

            I am pretty sure is related to JENKINS-59764, I have make the PR to the docker-plugin with the solution, it'd need trilead-ssh2-build-217-jenkins-15 on Jenkins Core 2.190- or trilead-api 1.0.5 on Jenkins core 2.190.1+

            Show
            ifernandezcalvo Ivan Fernandez Calvo added a comment - I am pretty sure is related to JENKINS-59764 , I have make the PR to the docker-plugin with the solution, it'd need trilead-ssh2-build-217-jenkins-15 on Jenkins Core 2.190- or trilead-api 1.0.5 on Jenkins core 2.190.1+
            Hide
            pjdarton pjdarton added a comment -

            I'd be a bit wary of updating the docker-plugin so it requires a fully up-to-date version of Jenkins.  I believe that users should be able to use the latest docker-plugin with a core Jenkins that's at least 6 months out of date.  If we had a critical security defect that required the latest Jenkins core then, sure, I guess we'd have to do that but for anything else, it'd be best not to force everyone to update.

            FYI what the docker-plugin is trying to do is to poll the docker container (that it just created) to wait until the container is ready to accept SSH connections before allowing the main SSHLauncher code to start trying to connect.  If the docker-plugin declared its provisioning "complete" and the slave stayed offline for any period then Jenkins would ask it to make more slave nodes than are required, causing resource wastage.

            Show
            pjdarton pjdarton added a comment - I'd be a bit wary of updating the docker-plugin so it requires a fully up-to-date version of Jenkins.  I believe that users should be able to use the latest docker-plugin with a core Jenkins that's at least 6 months out of date.  If we had a critical security defect that required the latest Jenkins core then, sure, I guess we'd have to do that but for anything else, it'd be best not to force everyone to update. FYI what the docker-plugin is trying to do is to poll the docker container (that it just created) to wait until the container is ready to accept SSH connections before allowing the main SSHLauncher code to start trying to connect.  If the docker-plugin declared its provisioning "complete" and the slave stayed offline for any period then Jenkins would ask it to make more slave nodes than are required, causing resource wastage.

              People

              • Assignee:
                ifernandezcalvo Ivan Fernandez Calvo
                Reporter:
                durgadas Durgadas Kamath
              • Votes:
                4 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: