Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59790

Container cannot connect to node because it doesn't exist

    Details

    • Similar Issues:

      Description

      We recently updated our version of Jenkins to 2.176.3. And now a connection error with docker-agent randomly block the queue of jobs :

      Refusing headers from remote: Unknown client name: docker-00026wu6nor9w
      

      The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

      I saw in the code of docker-plugin that the container is created and started before the Jenkins node. While the connection method is JNLP, the commands to download and run the remoting.jar are executed at the start of the container. But at this moment, the node wasn't added to Jenkins master.

      Have you ever encountered this error? Is there a solution?

      Is it possible to modify provision methods and create the Jenkins node before instanciate the container to fix this issue?

      Jenkins version : 2.176.3

      docker-plugin version : 1.1.7

      docker host version : 1.13.1

        Attachments

          Activity

          Hide
          matttt Mathieu Delrocq added a comment -

          We are actually testing "Attach Docker container" wich seems to be a solution. But, in the documentation of the plugin, this functionnality is marked as experimental. Is this still the case ?

          Show
          matttt Mathieu Delrocq added a comment - We are actually testing "Attach Docker container" wich seems to be a solution. But, in the documentation of the plugin , this functionnality is marked as experimental. Is this still the case ?
          Hide
          matttt Mathieu Delrocq added a comment -

          You can follow this issue on github : Issue #757

           

          Show
          matttt Mathieu Delrocq added a comment - You can follow this issue on github : Issue #757  
          Hide
          matttt Mathieu Delrocq added a comment - - edited

          I close this issue related to jenkinsci/docker-plugin#757

           

          Show
          matttt Mathieu Delrocq added a comment - - edited I close this issue related to jenkinsci/docker-plugin#757  
          Hide
          akom Alexander Komarov added a comment - - edited

          For those of us that do want to use JNLP rather than Attach mode, here is a quick and dirty workaround that has proven to stabilize docker launching under heavy load (in my case, a matrix job spawning 70 containers simultaneously).  All it does is re-run the JNLP script a few times in case the master wasn't ready.  Without this, I was left with a ton of stopped containers and no resources.

          I simply change the ENTRYPOINT in my images from the default (/usr/local/bin/jenkins-agent script from jenkins/jnlp-slave) to this script:

           

          #!/bin/bash
          
          ACTUAL_ENTRYPOINT=/usr/local/bin/jenkins-slave
          # sleep between retries, if needed (s)
          SLEEP=${JNLP_RETRY_SLEEP:-5}  
          # Try to reconnect this many times
          TRIES=${JNLP_RETRY_COUNT:-3}
          # Stop retrying after this many seconds regardless
          MAXTIME=${JNLP_RETRY_MAXTIME:-60}
          
          # Do not retry if we're running bash to debug in this container
          # more than 1 arg is probably jenkins jnlp start
          if [ $# -eq 1 ] ; then
            exec $*
          fi
          
          START=$(date +%s)
          while [ $TRIES -gt 0 ] && [ $(($(date +%s) - $START)) -lt $MAXTIME ] && ! $ACTUAL_ENTRYPOINT "$@" ; do
            CODE=$?
            echo "$ACTUAL_ENTRYPOINT exited [$CODE], waiting $SLEEP seconds and retrying"
            sleep $SLEEP
            TRIES=$(($TRIES - 1))
          done
          
          echo "Exiting [$CODE] with $TRIES remaining tries and $(($(date +%s) - $START)) seconds elapsed"
          
          exit $CODE
          

          and my Dockerfile looks like this:

          FROM jenkins/jnlp-slave # directly or indirectly
          CMD [ "/bin/bash" ]
          ENTRYPOINT [ "entrypoint" ]
          

          (My images have bash... for alpine you may want to change the shebang to /bin/sh)

           

          For the record, network stability has been an issue for me with Attach.  Since I'm using classic Swarm, the topology is too complex and the connection is sometimes lost:

          Master -> Swarm Manager -> Docker Host -> Container

          With JNLP, it's simply:

          Container -> Master

           

          Oleg Nenashev, perhaps it may be worthwhile to integrate some (better than the above) retry logic in jenkins-agent script or even slave.jar itself?  (possibly after identifying the "unknown name" error from the master).   Happy to make a PR with some guidance.

          Show
          akom Alexander Komarov added a comment - - edited For those of us that do want to use  JNLP rather than  Attach mode, here is a quick and dirty workaround that has proven to stabilize docker launching under heavy load (in my case, a matrix job spawning 70 containers simultaneously).  All it does is re-run the JNLP script a few times in case the master wasn't ready.  Without this, I was left with a ton of stopped containers and no resources. I simply change the ENTRYPOINT in my images from the default ( /usr/local/bin/jenkins-agent script from jenkins/jnlp-slave)  to this script:   #!/bin/bash ACTUAL_ENTRYPOINT=/usr/local/bin/jenkins-slave # sleep between retries, if needed (s) SLEEP=${JNLP_RETRY_SLEEP:-5} # Try to reconnect this many times TRIES=${JNLP_RETRY_COUNT:-3} # Stop retrying after this many seconds regardless MAXTIME=${JNLP_RETRY_MAXTIME:-60} # Do not retry if we're running bash to debug in this container # more than 1 arg is probably jenkins jnlp start if [ $# -eq 1 ] ; then exec $* fi START=$(date +%s) while [ $TRIES -gt 0 ] && [ $(($(date +%s) - $START)) -lt $MAXTIME ] && ! $ACTUAL_ENTRYPOINT "$@" ; do CODE=$? echo "$ACTUAL_ENTRYPOINT exited [$CODE], waiting $SLEEP seconds and retrying" sleep $SLEEP TRIES=$(($TRIES - 1)) done echo "Exiting [$CODE] with $TRIES remaining tries and $(($(date +%s) - $START)) seconds elapsed" exit $CODE and my Dockerfile looks like this: FROM jenkins/jnlp-slave # directly or indirectly CMD [ "/bin/bash" ] ENTRYPOINT [ "entrypoint" ] (My images have bash... for alpine you may want to change the shebang to /bin/sh)   For the record, network stability has been an issue for me with  Attach .  Since I'm using classic Swarm, the topology is too complex and the connection is sometimes lost: Master -> Swarm Manager -> Docker Host -> Container With JNLP, it's simply: Container -> Master   Oleg Nenashev , perhaps it may be worthwhile to integrate some (better than the above) retry logic in jenkins-agent script  or even slave.jar itself?  (possibly after identifying the "unknown name" error from the master).   Happy to make a PR with some guidance.
          Hide
          pjdarton pjdarton added a comment -

          Personally, I approve of adding retry logic to pretty-much anything network related. The SSH connection mechanism from Master -> Slave-Node has lots of (configurable) retry logic so there is "prior art" to having this.

          There's a lot of chicken/egg issues when it comes to starting Jenkins slaves and so it makes a lot of sense to ensure that no aspect of this delicate negotiation process requires things to happen in a specific order.
          e.g. in the docker-plugin's case, it wants to know the container-ID of the container (which is only available once the "run container" command has returned) to write into the slave node instance before returning it to Jenkins, so it has to start the container before Jenkins gets it ... but if the slave container starts up very quickly then Jenkins might well receive and reject its connection request before Jenkins adds the node to its list of permitted slaves.
          A retry mechanism would allow an easy workaround for this ... as well as helping with situations where the network between the master and slave is less than perfect.
          FYI the script I use for starting Windows VMs with the vSphere-plugin (linked to from the vSphere plugin's wiki page) that connect via JNLP contains a lot of retry logic and that's proved its worth many times over.

          What I would recommend, however, is that the number of retries and the delay between retries be made configurable.
          ...and I'd also recommend that

          $ACTUAL_ENTRYPOINT $*
          

          should be changed to

          ${ACTUAL_ENTRYPOINT} "$@"
          

          so whitespace in arguments gets preserved (something that'd be less of an issue if this retry logic was incorporated in the core /usr/local/bin/jenkins-slave script).

          Show
          pjdarton pjdarton added a comment - Personally, I approve of adding retry logic to pretty-much anything network related. The SSH connection mechanism from Master -> Slave-Node has lots of (configurable) retry logic so there is "prior art" to having this. There's a lot of chicken/egg issues when it comes to starting Jenkins slaves and so it makes a lot of sense to ensure that no aspect of this delicate negotiation process requires things to happen in a specific order. e.g. in the docker-plugin's case, it wants to know the container-ID of the container (which is only available once the "run container" command has returned) to write into the slave node instance before returning it to Jenkins, so it has to start the container before Jenkins gets it ... but if the slave container starts up very quickly then Jenkins might well receive and reject its connection request before Jenkins adds the node to its list of permitted slaves. A retry mechanism would allow an easy workaround for this ... as well as helping with situations where the network between the master and slave is less than perfect. FYI the script I use for starting Windows VMs with the vSphere-plugin (linked to from the vSphere plugin's wiki page) that connect via JNLP contains a lot of retry logic and that's proved its worth many times over. What I would recommend, however, is that the number of retries and the delay between retries be made configurable. ...and I'd also recommend that $ACTUAL_ENTRYPOINT $* should be changed to ${ACTUAL_ENTRYPOINT} "$@" so whitespace in arguments gets preserved (something that'd be less of an issue if this retry logic was incorporated in the core /usr/local/bin/jenkins-slave script).
          Hide
          akom Alexander Komarov added a comment -

          Completely agree, pjdarton.

          The above was meant to be an example of a quick-and-easy fix for my use, not a polished product.   Once we get into command-line args territory there is an increase in complexity (like shifting bash args).  Currently (with the script behavior hardcoded) I can simply substitute my images in both k8s and docker jenkins plugins, without manually configuring entrypoint command-line args in the UI (using implicit defaults).  

          So basically we agree that this logic would ideally be part of the jnlp image components.

          Fair point about spaces, I'll edit my code above.

          Show
          akom Alexander Komarov added a comment - Completely agree, pjdarton . The above was meant to be an example of a quick-and-easy fix for my use, not a polished product.   Once we get into command-line args territory there is an increase in complexity (like shifting bash args).  Currently (with the script behavior hardcoded) I can simply substitute my images in both  k8s and  docker jenkins plugins, without manually configuring entrypoint command-line args in the UI (using implicit defaults).   So basically we agree that this logic would ideally be part of the jnlp image components. Fair point about spaces, I'll edit my code above.
          Hide
          pjdarton pjdarton added a comment -

          Note: $@ not $*
          FYI

          "$*"

          will glob all CLI arguments into one argument, which is pretty-much guaranteed to break things (it'll break things if more than one argument was provided), whereas

          $*

          would only break things if folks provided arguments containing whitespace.

          "$@"

          is the best option when you want to "pass through all arguments as they were provided".

          TL;DR: Whitespace in arguments is very easy to get wrong

          Show
          pjdarton pjdarton added a comment - Note: $ @ not $ * FYI "$*" will glob all CLI arguments into one argument, which is pretty-much guaranteed to break things (it'll break things if more than one argument was provided), whereas $* would only break things if folks provided arguments containing whitespace. "$@" is the best option when you want to "pass through all arguments as they were provided". TL;DR: Whitespace in arguments is very easy to get wrong
          Hide
          akom Alexander Komarov added a comment - - edited

          Thanks pjdarton for the reminder.  I also added rudimentary configuration for sleep/etc via environment variables.

          Show
          akom Alexander Komarov added a comment - - edited Thanks pjdarton for the reminder.  I also added rudimentary configuration for sleep/etc via environment variables.

            People

            • Assignee:
              ndeloof Nicolas De Loof
              Reporter:
              matttt Mathieu Delrocq
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: