Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53810

Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: In Progress (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Labels:
    • Environment:
    • Similar Issues:

      Description

      Launching node/agent fails with

      ERROR: null
      java.util.concurrent.CancellationException

      We have large number number of jobs in queue which gets assigned to slaves being created by Docker plugin. Even, if we try creating slave and try to launch agent, it fails.
      Note: Slave image adheres to all the requirement and works well if there is no huge queue.

      Executor Status
      SSHLauncher{host='9.47.78.144', port=32870, credentialsId='slave-test', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=210, maxNumRetries=10, retryWaitTime=15, sshHostKeyVerificationStrategy=hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy, tcpNoDelay=true, trackCredentials=true}
      [09/27/18 02:53:32] [SSH] Opening SSH connection to 9.47.78.144:32870.
      [09/27/18 02:53:32] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
      [09/27/18 02:53:32] [SSH] Authentication successful.
      [09/27/18 02:53:32] [SSH] The remote user's environment is:
      BASH=/usr/bin/bash
      BASHOPTS=cmdhist:expand_aliases:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:promptvars:sourcepath
      BASH_ALIASES=()
      BASH_ARGC=()
      BASH_ARGV=()
      BASH_CMDS=()
      BASH_EXECUTION_STRING=set
      BASH_LINENO=()
      BASH_SOURCE=()
      BASH_VERSINFO=([0]="4" [1]="2" [2]="46" [3]="1" [4]="release" [5]="s390x-ibm-linux-gnu")
      BASH_VERSION='4.2.46(1)-release'
      DIRSTACK=()
      EUID=1000
      GROUPS=()
      HOME=/home/test
      HOSTNAME=01695f4aae73
      HOSTTYPE=s390x
      IFS=$' \t\n'
      LESSOPEN='||/usr/bin/lesspipe.sh %s'
      LOGNAME=test
      MACHTYPE=s390x-ibm-linux-gnu
      MAIL=/var/mail/test
      OPTERR=1
      OPTIND=1
      OSTYPE=linux-gnu
      PATH=/usr/local/bin:/usr/bin
      PIPESTATUS=([0]="0")
      PPID=13
      PS4='+ '
      PWD=/home/test
      SHELL=/bin/bash
      SHELLOPTS=braceexpand:hashall:interactive-comments
      SHLVL=1
      SSH_CLIENT='9.42.27.56 44378 22'
      SSH_CONNECTION='9.42.27.56 44378 172.17.0.2 22'
      TERM=dumb
      UID=1000
      USER=test
      _=sudo
      [09/27/18 02:53:32] [SSH] Checking java version of /home/test/jdk/bin/java
      Couldn't figure out the Java version of /home/test/jdk/bin/java
      bash: /home/test/jdk/bin/java: No such file or directory
      
      [09/27/18 02:53:33] [SSH] Checking java version of java
      [09/27/18 02:53:34] [SSH] java -version returned 1.8.0_151.
      [09/27/18 02:53:34] [SSH] Starting sftp client.
      [09/27/18 02:53:34] [SSH] Copying latest remoting.jar...
      [09/27/18 02:53:36] [SSH] Copied 776,265 bytes.
      Expanded the channel window size to 4MB
      [09/27/18 02:53:36] [SSH] Starting agent process: cd "/home/test" && java  -jar remoting.jar -workDir /home/test
      Sep 27, 2018 6:54:09 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using /home/test/remoting as a remoting work directory
      Both error and output logs will be printed to /home/test/remoting
      ERROR: null
      java.util.concurrent.CancellationException
      	at java.util.concurrent.FutureTask.report(FutureTask.java:121)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      	at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:904)
      	at hudson.slaves.DelegatingComputerLauncher.launch(DelegatingComputerLauncher.java:64)
      	at io.jenkins.docker.connector.DockerComputerConnector$1.launch(DockerComputerConnector.java:117)
      	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294)
      	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      [09/27/18 02:57:02] Launch failed - cleaning up connection
      Slave JVM has not reported exit code. Is it still running?
      [09/27/18 02:57:02] [SSH] Connection closed.
      
      
      

        Attachments

          Issue Links

            Activity

            Hide
            kraxel72 Gerd Hoffmann added a comment -

            I've changed two values actually, one from 210 to 600. My UI is in in german, so not fully sure what the english label is, "Connection Timeout in Seconds" sounds right though.

            The other one is two lines below the first, switched from 10 to 60. This is the interval between connection attempts according to the label.

            Show
            kraxel72 Gerd Hoffmann added a comment - I've changed two values actually, one from 210 to 600. My UI is in in german, so not fully sure what the english label is, "Connection Timeout in Seconds" sounds right though. The other one is two lines below the first, switched from 10 to 60. This is the interval between connection attempts according to the label.
            Hide
            mzanetti Michael Zanetti added a comment -

            Ok... I ran into the the same. I'd like to clarify the confusion about why increasing "Connection Timeouts in Seconds" actually helps with this issue even though that should only affect the connection establishment, but not already working connections.

            I have noticed that whenever I ran into this issue, first there has been a failed connection establishment. Jenkins would then retry to connect, and succeed on the second attempt and everything seems working, jobs start building. However, after about 3-4 minutes they fail with the above connection breakdown. This does indeed match the 210 seconds, but why?

            It seems there must be a bug in the ssh connector code, because when a connection attempt fails, jenkins does retry, but it does not seem to stop the timer running for that previous failed attempt. The second attempt might succeed to establish, but the timeout  timer for the first is still active and when it runs out, it will kill the current working connection, causing builds to fail.

             

            So, increasing the "Connection timeout in seconds" does work around this issue, the actual cause however, seems somewhere in the code that handles ssh connections.

            Show
            mzanetti Michael Zanetti added a comment - Ok... I ran into the the same. I'd like to clarify the confusion about why increasing "Connection Timeouts in Seconds" actually helps with this issue even though that should only affect the connection establishment, but not already working connections. I have noticed that whenever I ran into this issue, first there has been a failed connection establishment. Jenkins would then retry to connect, and succeed on the second attempt and everything seems working, jobs start building. However, after about 3-4 minutes they fail with the above connection breakdown. This does indeed match the 210 seconds, but why? It seems there must be a bug in the ssh connector code, because when a connection attempt fails, jenkins does retry, but it does not seem to stop the timer running for that previous failed attempt. The second attempt might succeed to establish, but the timeout  timer for the first is still active and when it runs out, it will kill the current working connection, causing builds to fail.   So, increasing the "Connection timeout in seconds" does work around this issue, the actual cause however, seems somewhere in the code that handles ssh connections.
            Hide
            ifernandezcalvo Ivan Fernandez Calvo added a comment -

            this timeout is only used on a connection, the time that you see 3-4 min it is the time that takes the Ping Thread to check the connection if it cannot communicate with the agent it breaks the channel. You have to review the exact exception you show in the Jenkins instance and the exceptions in the Agent logs, I'd recommend enabling verbose log in the SSH server to show the exact moment when the disconnection happens, because it uses to happen befor the ping thread check the connection, in that case, the problem is in the SSH configuration or in the network infrastructure.

            Show
            ifernandezcalvo Ivan Fernandez Calvo added a comment - this timeout is only used on a connection, the time that you see 3-4 min it is the time that takes the Ping Thread to check the connection if it cannot communicate with the agent it breaks the channel. You have to review the exact exception you show in the Jenkins instance and the exceptions in the Agent logs, I'd recommend enabling verbose log in the SSH server to show the exact moment when the disconnection happens, because it uses to happen befor the ping thread check the connection, in that case, the problem is in the SSH configuration or in the network infrastructure.
            Hide
            mzanetti Michael Zanetti added a comment - - edited

            I understand that this timeout is not supposed to do this. Still I am pretty sure it does...

             

            Ever since I increased it to an insanely high number, the issue is gone. 

            Show
            mzanetti Michael Zanetti added a comment - - edited I understand that this timeout is not supposed to do this. Still I am pretty sure it does...   Ever since I increased it to an insanely high number, the issue is gone. 
            Hide
            ifernandezcalvo Ivan Fernandez Calvo added a comment -

            Michael Zanetti
            I've read the whole thing again, and I think could be related to how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents, SSHLauncher is designed to manage one agent, not a fleet, so I want to test something pretty similar to your configuration and conditions to see if I can replicate the issue and find a solution. Could you tell me how your agent configurations look like? and in which condition the issue is exposed?

            Show
            ifernandezcalvo Ivan Fernandez Calvo added a comment - Michael Zanetti I've read the whole thing again, and I think could be related to how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents, SSHLauncher is designed to manage one agent, not a fleet, so I want to test something pretty similar to your configuration and conditions to see if I can replicate the issue and find a solution. Could you tell me how your agent configurations look like? and in which condition the issue is exposed?

              People

              • Assignee:
                ifernandezcalvo Ivan Fernandez Calvo
                Reporter:
                durgadas Durgadas Kamath
              • Votes:
                3 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated: