Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-37719

Build cannot be interrupted if `docker stop` hangs

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Cyrille Le Clerc found a problem in a build doing basically

      docker.image('...').inside {
        sh '...'
      }
      

      The shell command appeared to complete, but then the build seemed to hang, and did not respond to anything short of a hard kill.

      A virtual thread dump just showed WorkflowScript on the last line of the closure. A physical thread dump of the master showed the problem:

      at ...
      at hudson.Launcher$ProcStarter.join(Launcher.java:388)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:260)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:241)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:238)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:133)
      at org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:104)
      at ...
      

      On the agent, stream copiers were still active for the docker exec call associated with the shell step, as well as a docker stop call on the container.

      Indeed process inspection showed that both docker commands were running, and docker ps just hanged: for reasons TBD, the Docker daemon was not responding to requests (though docker version worked). Docker did not work again until sudo service docker restart was run.

      Was not clear if killing the docker stop process would have resumed the build, since by that point a sequence of kill escalations had already been attempted.

      The most straightforward fix in the plugin would be to use .start().joinWithTimeout(...) rather than .join() in DockerClient. Thus, a failure to clean up a container due to an outage in the Docker daemon would at worst result in a few minutes' delay followed by a (more or less) comprehensible error.

      (All Docker commands in the plugin that are expected to take a long time, because they might contact a registry, are run from sh steps in Groovy code so they are durable and cleanly interruptible. Docker commands which we do not need to wait for, like docker exec to run processes in the container, are just launched and the process handle discarded. DockerClient with its blocking join call is only used for commands which under normal conditions should be close to instantaneous: docker run -d after checking for local availability of the image, docker inspect, docker stop, etc. But clearly this assumption is not completely reliable.)

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            Implementing the recommendations in JENKINS-32986 would probably also have solved this: the CPS VM was blocked in an interruptible call, so Thread.interrupt ought to have caused Callback.finished to throw InterruptedException or some wrapper thereof, which should have caused the withDockerContainer step to fail cleanly.

            Show
            jglick Jesse Glick added a comment - Implementing the recommendations in JENKINS-32986 would probably also have solved this: the CPS VM was blocked in an interruptible call, so Thread.interrupt ought to have caused Callback.finished to throw InterruptedException or some wrapper thereof, which should have caused the withDockerContainer step to fail cleanly.
            Hide
            jglick Jesse Glick added a comment -

            Related problem in JENKINS-37720 getting diagnostics.

            Show
            jglick Jesse Glick added a comment - Related problem in JENKINS-37720 getting diagnostics.
            Hide
            jglick Jesse Glick added a comment -

            As a side note, after a hard kill, the entire build directory abruptly disappears except for build.xml, so the log disappears. Unclear why; perhaps LogRotator is trying to delete the build after it exits, but then something asynchronously saves build.xml?

            Show
            jglick Jesse Glick added a comment - As a side note, after a hard kill, the entire build directory abruptly disappears except for build.xml , so the log disappears. Unclear why; perhaps LogRotator is trying to delete the build after it exits, but then something asynchronously saves build.xml ?
            Show
            cleclerc Cyrille Le Clerc added a comment - We regularly reproduce see https://gist.github.com/cyrille-leclerc/20e36ba0926b429b6ed1e485f0339334
            Hide
            rtyler R. Tyler Croy added a comment -

            Adding a thread dump from a Pipeline I have hung at the moment:

            Thread #12
            	at WorkflowScript.run(WorkflowScript:64)
            	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123)
            	at DSL.withDockerContainer(Native Method)
            	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:122)
            	at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:63)
            	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:116)
            	at WorkflowScript.run(WorkflowScript:60)
            	at DSL.timeout(killer task reported done)
            	at WorkflowScript.run(WorkflowScript:56)
            	at DSL.stage(Native Method)
            	at WorkflowScript.run(WorkflowScript:52)
            	at DSL.node(running on trusted-agent-1)
            	at WorkflowScript.run(WorkflowScript:21)
            

            A resulting effect of this is that the timeout block will not properly work either leading to never-ending-Pipelines.

            Show
            rtyler R. Tyler Croy added a comment - Adding a thread dump from a Pipeline I have hung at the moment: Thread #12 at WorkflowScript.run(WorkflowScript:64) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123) at DSL.withDockerContainer(Native Method) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:122) at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:63) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:116) at WorkflowScript.run(WorkflowScript:60) at DSL.timeout(killer task reported done) at WorkflowScript.run(WorkflowScript:56) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:52) at DSL.node(running on trusted-agent-1) at WorkflowScript.run(WorkflowScript:21) A resulting effect of this is that the timeout block will not properly work either leading to never-ending-Pipelines.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            pom.xml
            src/main/java/org/jenkinsci/plugins/docker/workflow/WithContainerStep.java
            src/main/java/org/jenkinsci/plugins/docker/workflow/client/DockerClient.java
            src/test/java/org/jenkinsci/plugins/docker/workflow/DockerTestUtil.java
            src/test/java/org/jenkinsci/plugins/docker/workflow/WithContainerStepTest.java
            http://jenkins-ci.org/commit/docker-workflow-plugin/3f8a83fbacdce1e3d286cf1790182a89d9f6cd11
            Log:
            [FIXED JENKINS-37719] Apply a timeout to all Docker CLI operations not already run from durable steps.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/main/java/org/jenkinsci/plugins/docker/workflow/WithContainerStep.java src/main/java/org/jenkinsci/plugins/docker/workflow/client/DockerClient.java src/test/java/org/jenkinsci/plugins/docker/workflow/DockerTestUtil.java src/test/java/org/jenkinsci/plugins/docker/workflow/WithContainerStepTest.java http://jenkins-ci.org/commit/docker-workflow-plugin/3f8a83fbacdce1e3d286cf1790182a89d9f6cd11 Log: [FIXED JENKINS-37719] Apply a timeout to all Docker CLI operations not already run from durable steps.

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                jglick Jesse Glick
              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: