Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48490

Intermittently slow docker provisioning with no errors

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Minor
    • Resolution: Fixed
    • Component/s: docker-plugin
    • Labels:
      None
    • Environment:
      Jenkins 2.93
      Docker Plugin 1.1.1
      Containers are using JNLP
    • Similar Issues:

      Description

      I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

      When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:

      1. Node is allocated immediately
      2. Node is not allocated and jenkins logs indicate why (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
      3. Node is allocated with a significant delay (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
      4. Node is allocated with a ridiculous delay (I just had one take 77 minutes).  Logs do not indicate any activity from the Docker plugin until it is allocated.  Other jobs have gotten containers allocated since (and those events are in the logs).  An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

      How can I troubleshoot this behavior, especially #4?

      Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

      In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

        Attachments

          Activity

          Hide
          jswager1 Jason Swager added a comment -

          We are seeing similar symptoms after upgrading the Docker plugin.  Our Jenkins masters started seeing this problem after upgrading to Docker Plugin v1.1.1 and even v1.1.2. One big difference, we're using SSH to connect rather than JNLP.

          The larger and more busy the Jenkins master, the faster this problem occurs.  Our larger one we had to downgrade the Docker plugin to it's prior version, 0.16.2.  The smaller Jenkins don't suffer immediately from the problem and a restart of them clears the problem - at least for another couple days. 

          Show
          jswager1 Jason Swager added a comment - We are seeing similar symptoms after upgrading the Docker plugin.  Our Jenkins masters started seeing this problem after upgrading to Docker Plugin v1.1.1 and even v1.1.2. One big difference, we're using SSH to connect rather than JNLP. The larger and more busy the Jenkins master, the faster this problem occurs.  Our larger one we had to downgrade the Docker plugin to it's prior version, 0.16.2.  The smaller Jenkins don't suffer immediately from the problem and a restart of them clears the problem - at least for another couple days. 
          Hide
          seakip18 Matthew Ludlum added a comment -

          I've been trying to reproduce this issue on my local box using SSH executors with both 1.1.2 and 1.1.3 to no avail. We are still seeing it on other instances.

          In the short term, I've thrown together  a quick and dirty script job to "unstick" the jobs.

          https://gist.github.com/MattLud/1f8a56fcce933f7e97c366de54c85ba9

           

           

          Show
          seakip18 Matthew Ludlum added a comment - I've been trying to reproduce this issue on my local box using SSH executors with both 1.1.2 and 1.1.3 to no avail. We are still seeing it on other instances. In the short term, I've thrown together  a quick and dirty script job to "unstick" the jobs. https://gist.github.com/MattLud/1f8a56fcce933f7e97c366de54c85ba9    
          Hide
          pjdarton pjdarton added a comment -

          We used to encounter these kinds of issues. We eventually traced it to cloud plugins doing blocking operations within the main "provisioning" thread and node-disposal operations (which happen while a Jenkins core object is locked) were also sometimes taking a long time. These combined to causing Jenkins' ability to create and destroy slaves was severely impeded. This only happened when the servers that the plugins were communicating with weren't responding swiftly, but as Docker is prone to total lockups (and vSphere can take 4 hours to fail a 40millisecond operation), when all is not well, Jenkins also becomes unwell.

          It took a fair bit of work, but I made enhancements to both the vSphere plugin and the Docker plugin to reduce the amount of remote-API-calls made during the "provisioning" and "termination" process threads, and to ensure that everything had a timeout (so nothing would lock up forever). The vSphere plugin (version 2.16 onwards) contains my changes, but you'd have to get the bleeding-edge build of the docker-plugin from here for my changes to that (as we've not done a release of that yet).
          Note: If you take the docker plugin, make sure that you set a non-zero connection timeout, read timeout and (in the templates) pull timeout. Also, for pure speed, remove the instance caps (if you don't specify an instance cap, the plugin no longer counts the instances). That should ensure that nothing can cause the core Jenkins cloud resources to stay "locked" for a long period of time.

          See also: JENKINS-49235 as that has the potential to cause problems for a busy Jenkins server.

          Show
          pjdarton pjdarton added a comment - We used to encounter these kinds of issues. We eventually traced it to cloud plugins doing blocking operations within the main "provisioning" thread and node-disposal operations (which happen while a Jenkins core object is locked) were also sometimes taking a long time. These combined to causing Jenkins' ability to create and destroy slaves was severely impeded. This only happened when the servers that the plugins were communicating with weren't responding swiftly, but as Docker is prone to total lockups (and vSphere can take 4 hours to fail a 40millisecond operation), when all is not well, Jenkins also becomes unwell. It took a fair bit of work, but I made enhancements to both the vSphere plugin and the Docker plugin to reduce the amount of remote-API-calls made during the "provisioning" and "termination" process threads, and to ensure that everything had a timeout (so nothing would lock up forever). The vSphere plugin (version 2.16 onwards) contains my changes, but you'd have to get the bleeding-edge build of the docker-plugin from here for my changes to that (as we've not done a release of that yet). Note: If you take the docker plugin, make sure that you set a non-zero connection timeout, read timeout and (in the templates) pull timeout. Also, for pure speed, remove the instance caps (if you don't specify an instance cap, the plugin no longer counts the instances). That should ensure that nothing can cause the core Jenkins cloud resources to stay "locked" for a long period of time. See also: JENKINS-49235 as that has the potential to cause problems for a busy Jenkins server.
          Hide
          pjdarton pjdarton added a comment -

          I believe that, through careful use of "read timeout" and "pull timeout" on docker clouds & templates, coupled with the "avoid using broken clouds/templates" feature (all introduced in 1.1.4) this issue should now be fixed.

          We have a fairly busy Jenkins server with dozens of static nodes, lots of docker clouds, hundreds of jobs, and many dozens of builds running at any one time and, since adding this functionality, it all seems stable now (other than JENKINS-53621 which is a separate issue).

           

          TL;DR: I think it's fixed; re-open if it's still there in 1.1.6.

          Show
          pjdarton pjdarton added a comment - I believe that, through careful use of "read timeout" and "pull timeout" on docker clouds & templates, coupled with the "avoid using broken clouds/templates" feature (all introduced in 1.1.4) this issue should now be fixed. We have a fairly busy Jenkins server with dozens of static nodes, lots of docker clouds, hundreds of jobs, and many dozens of builds running at any one time and, since adding this functionality, it all seems stable now (other than JENKINS-53621 which is a separate issue).   TL;DR: I think it's fixed; re-open if it's still there in 1.1.6.
          Hide
          seaspu Dennis Keitzel added a comment -

          For reference: We hit a similar issue where the cloud-stats-plugin was the cause. See JENKINS-56863.

          Show
          seaspu Dennis Keitzel added a comment - For reference: We hit a similar issue where the cloud-stats-plugin was the cause. See JENKINS-56863 .

            People

            • Assignee:
              pjdarton pjdarton
              Reporter:
              akom Alexander Komarov
            • Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: