[JENKINS-57831] Swarm client failed to reconnect after Jenkins controller restart - Jenkins Jira

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: remoting, swarm-plugin
Labels:
None
Environment:
Jenkins 2.150.1
Swarm client 3.17 (Remoting 3.30)

Similar Issues:
Powered by SuggestiMate

Show
Released As:
3.26

Hi, I'm the new maintainer of the Swarm Plugin. I encountered an issue with tonight after doing a routine restart of a Jenkins master (to perform a plugin update) that resulted in all my Swarm clients losing their connection to that master (but not my other masters). I explain the details below. I'd welcome your thoughts on my root cause analysis below, and I'd be happy to collaborate on a solution with you.

Problem

Typically, my Swarm clients reconnect just fine after a master restarts due to my use of the Swarm client -deleteExistingClients feature. In fact, I even have a unit test for this functionality. And tonight, Swarm clients successfully reconnected when all of my Jenkins masters were restarted, except for one. On that single master (but not the others), all the Swarm clients failed to reconnect. The Swarm client logs on all the failed clients showed messages like the following:

2019-06-04 03:08:24 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl Connecting to http://example.com/ to configure swarm client.
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClient createHttpClient() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext createHttpClientContext() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext Setting HttpClient credentials based on options passed
2019-06-04 03:08:24 FINE hudson.plugins.swarm.Candidate <init> Candidate constructed with url: http://example.com/, secret: <redacted>
2019-06-04 03:08:24 INFO hudson.plugins.swarm.Client run Attempting to connect to http://example.com/ <redacted> with ID 
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createSwarmSlave createSwarmSlave() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClient createHttpClient() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext createHttpClientContext() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext Setting HttpClient credentials based on options passed
2019-06-04 03:08:24 SEVERE hudson.plugins.swarm.SwarmClient getCsrfCrumb Could not obtain CSRF crumb. Response code: 404
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient connect connect() invoked
2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection writeRequests sun.net.www.MessageHeader@4973813a6 pairs: {GET /computer/vm.example.com/slave-agent.jnlp HTTP/1.1: null}{Authorization: Basic YmxhY2tib3g6YmxhY2tib3gxMjM=}{User-Agent: Java/1.8.0_202}{Host: example.com}{Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2}{Connection: keep-alive}
2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection getInputStream0 sun.net.www.MessageHeader@6321e8137 pairs: {null: HTTP/1.1 200 OK}{Date: Tue, 04 Jun 2019 03:08:24 GMT}{Content-Type: application/x-java-jnlp-file}{Content-Length: 772}{Connection: keep-alive}{X-Content-Type-Options: nosniff}{Server: Jetty(9.4.z-SNAPSHOT)}
2019-06-04 03:08:24 INFO hudson.remoting.jnlp.Main createEngine Setting up agent: vm.example.com
2019-06-04 03:08:24 INFO hudson.remoting.jnlp.Main$CuiListener <init> Jenkins agent is running in headless mode.
2019-06-04 03:08:24 INFO hudson.remoting.Engine startEngine Using Remoting version: 3.30
2019-06-04 03:08:24 WARNING hudson.remoting.Engine startEngine No Working Directory. Using the legacy JAR Cache location: /var/tmp/jenkins/.jenkins/cache/jars
2019-06-04 03:08:24 FINE hudson.remoting.Engine startEngine Using standard File System JAR Cache. Root Directory is /var/tmp/jenkins/.jenkins/cache/jars
2019-06-04 03:08:24 FINE org.jenkinsci.remoting.protocol.IOHub create Staring an additional Selector wakeup thread. See JENKINS-47965 for more info
2019-06-04 03:08:24 INFO hudson.remoting.jnlp.Main$CuiListener status Locating server among [http://example.com/]
2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection writeRequests sun.net.www.MessageHeader@4ae378ec6 pairs: {GET /tcpSlaveAgentListener/ HTTP/1.1: null}{Authorization: Basic YmxhY2tib3g6YmxhY2tib3gxMjM=}{User-Agent: Java/1.8.0_202}{Host: example.com}{Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2}{Connection: keep-alive}
2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection getInputStream0 sun.net.www.MessageHeader@fe1efd411 pairs: {null: HTTP/1.1 200 OK}{Date: Tue, 04 Jun 2019 03:08:24 GMT}{Content-Type: text/plain;charset=utf-8}{Content-Length: 12}{Connection: keep-alive}{X-Content-Type-Options: nosniff}{X-Hudson-JNLP-Port: 55000}{X-Jenkins-JNLP-Port: 55000}{X-Instance-Identity: <redacted>}{X-Jenkins-Agent-Protocols: JNLP4-connect, Ping}{Server: Jetty(9.4.z-SNAPSHOT)}
2019-06-04 03:08:24 INFO org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve Remoting server accepts the following protocols: [JNLP4-connect, Ping]
2019-06-04 03:08:24 WARNING org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver isPortVisible Connection refused (Connection refused)
2019-06-04 03:08:24 SEVERE hudson.remoting.jnlp.Main$CuiListener error http://example.com/ provided port:55000 is not reachable
java.io.IOException: http://example.com/ provided port:55000 is not reachable
        at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:287)
        at hudson.remoting.Engine.innerRun(Engine.java:523)
        at hudson.remoting.Engine.run(Engine.java:474)

At this point, the Swarm client exited (with unknown exit code) and never recovered. The higher-level Jenkins jobs failed.

Background

Of note is that my masters have a custom JNLP port, which I set from a Groovy initialization script like so:

def env = System.getenv()
int port = env['JENKINS_SLAVE_AGENT_PORT'].toInteger()
if (Jenkins.get().slaveAgentPort != port) {
  Jenkins.get().slaveAgentPort = port
  Jenkins.get().save()
}

So for a short period of time during Jenkins initialization, before this Groovy initialization script is running, Jenkins is up (and therefore replying to HTTP connection requests) but the JNLP port settings haven't been applied yet (so a connection to the JNLP port would fail).

Analysis

Here is my analysis of the situation:

The Jenkins master went down, then started initializing again.
Swarm client successfully communicated with the master via HTTP to create the new agent (see the hudson.plugins.swarm.SwarmClient createSwarmSlave line above, which clearly shows the Swarm client was able to successfully communicate with the master over HTTP).
Swarm client then delegated to Remoting, calling hudson.remoting.jnlp.Main#main, which called _main, which called another main, which called Engine#startEngine. We know this because we see the "Using Remoting version" and "Using custom JAR Cache" lines above.
Engine#startEngine started a thread, which invoked Engine#run. We know this because the org.jenkinsci.remoting.protocol.IOHub create log line was printed.
Engine#run got all the way into Engine#innerRun, which got as far as the endpoint = resolver.resolve() call on line 523. We know this because the log statement "Locating server among..." was printed.
In JnlpAgentEndpoint#resolve, we successfully made an HTTP call to the server to list the available protocols. Again, we know this because the log statement "Remoting server accepts the following protocols" was printed.
In JnlpAgentEndpoint#resolve, we call isPortVisible, and here is where things go haywire. At this point, the JNLP port is not available yet, even though the server is responding to HTTP requests, presumably because my Groovy initialization script hasn't run yet. We get the error http://example.com/ provided port:55000 is not reachable from JnlpAgentEndpointResolver#resolve.
isPortVisible returns false to JnlpAgentEndpointResolver#resolve, which sets firstError to a new IOException, then continues. We have nothing else to loop through, so we get to the bottom of the method and throw firstError, which in this case is the IOException.
The caller of JnlpAgentEndpointResolver#resolve, Engine#innerRun, catches the exception and returns.
Back in Engine#run, innerRun returns and then run returns. At this point the thread dies. We pop the stack all the way back up to Main#main and ultimately back to the Swarm Client itself, which exits.

Possible Solutions

Clearly, this is a suboptimal outcome. (In practice, it took down a bunch of my test automation tonight.) What are your thoughts on how this problem could be solved? Here are some of mine.

Ideally, Jenkins core could not respond to HTTP requests until the JNLP port is available. Unfortunately, I don't see a practical way to make this the reality. There doesn't seem to be a way to set the JNLP port early on in Jenkins startup today (hence the need for my Groovy initialization script). I'm not sure whether or not it's feasible to try and add such an option. Even if it were feasible, I still don't know enough about Jenkins early initialization to be able to guarantee that it would close the race. This seems like the ideal solution in the long term, but it's quite impractical for the short or medium term.
Could we have Remoting try a little harder, knowing that there is a race between the Jenkins master being available via HTTP and JNLP? In practice this race is very small, and I rarely hit it. Retrying up to a minute or so, with a bit of backoff along the way, might be "good enough". Would this be a direction you want to go in? This option appeals to me because it seems more realistic to implement, and it would also benefit non-Swarm JNLP clients. This seems to be the best medium-term solution.
Should we have Swarm somehow detect this condition and re-invoke hudson.remoting.jnlp.Main#main? Swarm already has command-line options for retries, so we could take advantage of one of those to try and restart the JNLP client if the thread dies for some reason. This seems a bit sub-optimal, since it would only benefit Swarm clients and not other JNLP clients. But it could be done as a short-term solution.

Let me know what you think about my analysis and these possible solutions. I'd be happy to collaborate with you to get this fixed.

links to

jenkinsci/remoting#449

jenkinsci/swarm-plugin#321

Details

Description

Problem

Background

Analysis

Possible Solutions

Attachments

Issue Links

Activity

People

Dates