Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59817

Swarm client hangs indefinitely while waiting for HTTP handshake to complete

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Component/s: swarm-plugin
    • Labels:
    • Environment:
    • Similar Issues:

      Description

      The swarm client connection hangs when the HTTP handshake with a Jenkins Master hangs.

      In our situation, the Jenkins Master was responsive via UI and there was no issue connecting a conventional Agent via the web launcher. For some reason, however, the connection with a Swarm client would simply hang. The Master didn't log any attempt for connection, hence I'm only able to provide a log from the client.

      Restarting the Master did solve the issue (that's why I'm reporting the bug as "minor" at first), but my main concern is that, as the Swarm client was designed for auto-discovery, there is a chance that clients would gradually sink into a broken Master and hang indefinitely, leaving the remaining instances in the cluster unattended.

      Some attempts...

      We didn't have any issues connecting the same Swarm clients to other Masters in the same infrastructure. Hence, network issue was ruled out.

      We tried with Swarm client v3.17 and 3.14, to no avail.

      The Swarm client failed to connect from both Windows and Linux (CentOS) nodes.

      About the logs...

      Sadly, I had to replace company name, machine name and stuff like that... sorry about it.

      For log collection I have disable the SSL verification in the node.

      swarm-healthy-log.txt is the full log of the swarm client connecting to a Master from our infra without issues (for reference).

      swarm-issue-log1.txt and swarm-issue-log2.txt are the full logs connecting to the troublesome Master. Notice that the handshake failed at different points. Sometimes the first handshake would succeed, but we never succeeded in the second one.

       

      Expectation...

      I understand that the Master could have been in a corrupt state somehow. As said, restarting it brought things back to normal.

      However, we should expect the Swarm client to be resilient against any issues with the Master, more precisely because of the auto-discovery feature. This client has more autonomy, and if it can't connect to a Master, simply move on to the next one.

      The suggested fix (maybe I'm being naive on this) would be a timeout for all the HTTP requests.

      For example here and here:

      https://github.com/jenkinsci/swarm-plugin/blob/bfbd2c79eea470847335fb6a0ef9ce19d425429b/client/src/main/java/hudson/plugins/swarm/SwarmClient.java#L478

      https://github.com/jenkinsci/swarm-plugin/blob/bfbd2c79eea470847335fb6a0ef9ce19d425429b/client/src/main/java/hudson/plugins/swarm/SwarmClient.java#L392

       

       

        Attachments

          Activity

          Hide
          basil Basil Crow added a comment -

          I like the suggestion for an HTTP timeout. If this problem occurs again, could you get a thread dump from the Jenkins master and the Swarm client agent at the time of the hang? I would like to see the stack trace on each side of the connection. This should help pinpoint the cause of the hang.

          Show
          basil Basil Crow added a comment - I like the suggestion for an HTTP timeout. If this problem occurs again, could you get a thread dump from the Jenkins master and the Swarm client agent at the time of the hang? I would like to see the stack trace on each side of the connection. This should help pinpoint the cause of the hang.

            People

            • Assignee:
              Unassigned
              Reporter:
              rafaelrezend Rafael Rezende
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: