Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59790

Container cannot connect to node because it doesn't exist

    Details

    • Similar Issues:

      Description

      We recently updated our version of Jenkins to 2.176.3. And now a connection error with docker-agent randomly block the queue of jobs :

      Refusing headers from remote: Unknown client name: docker-00026wu6nor9w
      

      The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

      I saw in the code of docker-plugin that the container is created and started before the Jenkins node. While the connection method is JNLP, the commands to download and run the remoting.jar are executed at the start of the container. But at this moment, the node wasn't added to Jenkins master.

      Have you ever encountered this error? Is there a solution?

      Is it possible to modify provision methods and create the Jenkins node before instanciate the container to fix this issue?

      Jenkins version : 2.176.3

      docker-plugin version : 1.1.7

      docker host version : 1.13.1

        Attachments

          Issue Links

            Activity

            Hide
            akom Alexander Komarov added a comment -

            Completely agree, pjdarton.

            The above was meant to be an example of a quick-and-easy fix for my use, not a polished product.   Once we get into command-line args territory there is an increase in complexity (like shifting bash args).  Currently (with the script behavior hardcoded) I can simply substitute my images in both k8s and docker jenkins plugins, without manually configuring entrypoint command-line args in the UI (using implicit defaults).  

            So basically we agree that this logic would ideally be part of the jnlp image components.

            Fair point about spaces, I'll edit my code above.

            Show
            akom Alexander Komarov added a comment - Completely agree, pjdarton . The above was meant to be an example of a quick-and-easy fix for my use, not a polished product.   Once we get into command-line args territory there is an increase in complexity (like shifting bash args).  Currently (with the script behavior hardcoded) I can simply substitute my images in both  k8s and  docker jenkins plugins, without manually configuring entrypoint command-line args in the UI (using implicit defaults).   So basically we agree that this logic would ideally be part of the jnlp image components. Fair point about spaces, I'll edit my code above.
            Hide
            pjdarton pjdarton added a comment -

            Note: $@ not $*
            FYI

            "$*"

            will glob all CLI arguments into one argument, which is pretty-much guaranteed to break things (it'll break things if more than one argument was provided), whereas

            $*

            would only break things if folks provided arguments containing whitespace.

            "$@"

            is the best option when you want to "pass through all arguments as they were provided".

            TL;DR: Whitespace in arguments is very easy to get wrong

            Show
            pjdarton pjdarton added a comment - Note: $ @ not $ * FYI "$*" will glob all CLI arguments into one argument, which is pretty-much guaranteed to break things (it'll break things if more than one argument was provided), whereas $* would only break things if folks provided arguments containing whitespace. "$@" is the best option when you want to "pass through all arguments as they were provided". TL;DR: Whitespace in arguments is very easy to get wrong
            Hide
            akom Alexander Komarov added a comment - - edited

            Thanks pjdarton for the reminder.  I also added rudimentary configuration for sleep/etc via environment variables.

            Show
            akom Alexander Komarov added a comment - - edited Thanks pjdarton for the reminder.  I also added rudimentary configuration for sleep/etc via environment variables.
            Hide
            gregory_picot Gregory PICOT added a comment - - edited

            Hi,

            Thank Alexander Komarov for the retry bit, we tried it on our end to secure the jnlp connexion. The retry in itself seems to work great, but since we implemented it, our containers doesn't end properly:

            The kill -15 sent on job termination is not correctly interpreted by the container, and we have to wait 10 sec for a kill -9 to really put an end to the container.

            This is problematic because we have a small window where the master believe the container (and the agent related) is free to use. It could then try to start a job in it, with no chance to last long.

            What I could figure out is that since the entrypoint is the new script, when the $ACTUAL_ENTRYPOINT run, it is not the PID 1, and the jar is not linked to our new entrypoint (since is it started by exec command).

            When the kill -15 occur, not every process are killed, so the container stays alive.

            I tried to add the exec command before runnning the $ACTUAL_ENTRYPOINT, it resolve the issue of the sigterm interpretation, but we loose the retry logic...

            I'm still trying to figure out a solution to torward properly sigterms and keep the retry.

            Show
            gregory_picot Gregory PICOT added a comment - - edited Hi, Thank Alexander Komarov for the retry bit, we tried it on our end to secure the jnlp connexion. The retry in itself seems to work great, but since we implemented it, our containers doesn't end properly: The kill -15 sent on job termination is not correctly interpreted by the container, and we have to wait 10 sec for a kill -9 to really put an end to the container. This is problematic because we have a small window where the master believe the container (and the agent related) is free to use. It could then try to start a job in it, with no chance to last long. What I could figure out is that since the entrypoint is the new script, when the $ACTUAL_ENTRYPOINT run, it is not the PID 1, and the jar is not linked to our new entrypoint (since is it started by exec command). When the kill -15 occur, not every process are killed, so the container stays alive. I tried to add the exec command before runnning the $ACTUAL_ENTRYPOINT, it resolve the issue of the sigterm interpretation, but we loose the retry logic... I'm still trying to figure out a solution to torward properly sigterms and keep the retry.
            Hide
            gregory_picot Gregory PICOT added a comment -

            Hello,

             

            Here's an update regarding the issue between sigterm propagation and retry :

            We managed to make them work together following this article: https://unix.stackexchange.com/questions/146756/forward-sigterm-to-child-in-bash

            Instead of using the exec command, we used wait and trap commands to meet our need. I find the Article from Andreas Veithen++ to be very interesting and detailed.

             

            We put back the "usual" jenkins entrypoint in the dockerfile, but instead of starting the jar, it launch our script :

            exec $JAVA_BIN $JAVA_OPTS $JNLP_PROTOCOL_OPTS -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless $TUNNEL $URL $WORKDIR $DIRECT $PROTOCOLS $INSTANCE_IDENTITY $OPT_JENKINS_SECRET $OPT_JENKINS_AGENT_NAME "$@"
            

            replaced by :

             

            exec /usr/local/bin/jenkins-agent-retry.sh "$@"
            

             

             

            Here is what jenkins-agent-retry.sh look like now :

             

            #!/usr/bin/env sh
            
            if [ $# -eq 1 ]; then
                # if `docker run` only has one arguments, we assume user is running alternate command like `bash` to inspect the image
                exec "$@"
            fi
            
            
            # Gestion SIGTERM https://unix.stackexchange.com/questions/146756/forward-sigterm-to-child-in-bash
            prep_term()
            {
                unset term_child_pid
                unset term_kill_needed
                trap 'handle_term' TERM INT
            }
            
            handle_term()
            {
                if [ "${term_child_pid}" ]; then
                    kill -TERM "${term_child_pid}" 2>/dev/null
                else
                    term_kill_needed="yes"
                fi
            }
            
            wait_term()
            {
                term_child_pid=$!
                if [ "${term_kill_needed}" ]; then
                    kill -TERM "${term_child_pid}" 2>/dev/null 
                fi
                wait ${term_child_pid}
                trap - TERM INT
                wait ${term_child_pid}
            }
            
            echo "[INFO] JDK $($JAVA_BIN -version 2>&1|awk '$0~/openjdk version/ {print $3}') to connect to master"
            echo "[INFO] Remoting Version: $($JAVA_BIN -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless -version)"
            echo "[INFO] Start Agent command: " $JAVA_BIN $JAVA_OPTS $JNLP_PROTOCOL_OPTS -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless $TUNNEL $URL $WORKDIR $DIRECT $PROTOCOLS $INSTANCE_IDENTITY $OPT_JENKINS_SECRET $OPT_JENKINS_AGENT_NAME "$@"
            
            # Gestion Retry from https://issues.jenkins-ci.org/browse/JENKINS-59790?focusedCommentId=379913&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-379913
            SLEEP=${JNLP_RETRY_SLEEP:-5}
            # Try to reconnect this many times
            TRIES=${JNLP_RETRY_COUNT:-3}
            # Stop retrying after this many seconds regardless
            MAXTIME=${JNLP_RETRY_MAXTIME:-60}
            
            START=$(date +%s)
            while [ $TRIES -gt 0 ] && [ $(($(date +%s) - $START)) -lt $MAXTIME ]; do
                prep_term
                $JAVA_BIN $JAVA_OPTS $JNLP_PROTOCOL_OPTS -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless $TUNNEL $URL $WORKDIR $DIRECT $PROTOCOLS $INSTANCE_IDENTITY $OPT_JENKINS_SECRET $OPT_JENKINS_AGENT_NAME "$@" &
                wait_term
                CODE=$?
                if [ $CODE -eq 143 ]; then
                    break
                fi
                echo "exited [$CODE], waiting $SLEEP seconds and retrying"
                sleep $SLEEP
                TRIES=$(($TRIES - 1))
            done
            

             

            Show
            gregory_picot Gregory PICOT added a comment - Hello,   Here's an update regarding the issue between sigterm propagation and retry : We managed to make them work together following this article: https://unix.stackexchange.com/questions/146756/forward-sigterm-to-child-in-bash Instead of using the exec command, we used wait and trap commands to meet our need. I find the Article from Andreas Veithen++  to be very interesting and detailed.   We put back the "usual" jenkins entrypoint in the dockerfile, but instead of starting the jar, it launch our script : exec $JAVA_BIN $JAVA_OPTS $JNLP_PROTOCOL_OPTS -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless $TUNNEL $URL $WORKDIR $DIRECT $PROTOCOLS $INSTANCE_IDENTITY $OPT_JENKINS_SECRET $OPT_JENKINS_AGENT_NAME "$@" replaced by :   exec /usr/local/bin/jenkins-agent-retry.sh "$@"     Here is what jenkins-agent-retry.sh look like now :   #!/usr/bin/env sh if [ $# -eq 1 ]; then # if `docker run` only has one arguments, we assume user is running alternate command like `bash` to inspect the image exec "$@" fi # Gestion SIGTERM https: //unix.stackexchange.com/questions/146756/forward-sigterm-to-child-in-bash prep_term() { unset term_child_pid unset term_kill_needed trap 'handle_term' TERM INT } handle_term() { if [ "${term_child_pid}" ]; then kill -TERM "${term_child_pid}" 2>/dev/ null else term_kill_needed= "yes" fi } wait_term() { term_child_pid=$! if [ "${term_kill_needed}" ]; then kill -TERM "${term_child_pid}" 2>/dev/ null fi wait ${term_child_pid} trap - TERM INT wait ${term_child_pid} } echo "[INFO] JDK $($JAVA_BIN -version 2>&1|awk '$0~/openjdk version/ {print $3}' ) to connect to master" echo "[INFO] Remoting Version: $($JAVA_BIN -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless -version)" echo "[INFO] Start Agent command: " $JAVA_BIN $JAVA_OPTS $JNLP_PROTOCOL_OPTS -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless $TUNNEL $URL $WORKDIR $DIRECT $PROTOCOLS $INSTANCE_IDENTITY $OPT_JENKINS_SECRET $OPT_JENKINS_AGENT_NAME "$@" # Gestion Retry from https: //issues.jenkins-ci.org/browse/JENKINS-59790?focusedCommentId=379913&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-379913 SLEEP=${JNLP_RETRY_SLEEP:-5} # Try to reconnect this many times TRIES=${JNLP_RETRY_COUNT:-3} # Stop retrying after this many seconds regardless MAXTIME=${JNLP_RETRY_MAXTIME:-60} START=$(date +%s) while [ $TRIES -gt 0 ] && [ $(($(date +%s) - $START)) -lt $MAXTIME ]; do prep_term $JAVA_BIN $JAVA_OPTS $JNLP_PROTOCOL_OPTS -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless $TUNNEL $URL $WORKDIR $DIRECT $PROTOCOLS $INSTANCE_IDENTITY $OPT_JENKINS_SECRET $OPT_JENKINS_AGENT_NAME "$@" & wait_term CODE=$? if [ $CODE -eq 143 ]; then break fi echo "exited [$CODE], waiting $SLEEP seconds and retrying" sleep $SLEEP TRIES=$(($TRIES - 1)) done  

              People

              • Assignee:
                ndeloof Nicolas De Loof
                Reporter:
                matttt Mathieu Delrocq
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: