Hi, I'm the new maintainer of the Swarm Plugin. I encountered an issue with tonight after doing a routine restart of a Jenkins master (to perform a plugin update) that resulted in all my Swarm clients losing their connection to that master (but not my other masters). I explain the details below. I'd welcome your thoughts on my root cause analysis below, and I'd be happy to collaborate on a solution with you.
Typically, my Swarm clients reconnect just fine after a master restarts due to my use of the Swarm client -deleteExistingClients feature. In fact, I even have a unit test for this functionality. And tonight, Swarm clients successfully reconnected when all of my Jenkins masters were restarted, except for one. On that single master (but not the others), all the Swarm clients failed to reconnect. The Swarm client logs on all the failed clients showed messages like the following:
At this point, the Swarm client exited (with unknown exit code) and never recovered. The higher-level Jenkins jobs failed.
Of note is that my masters have a custom JNLP port, which I set from a Groovy initialization script like so:
So for a short period of time during Jenkins initialization, before this Groovy initialization script is running, Jenkins is up (and therefore replying to HTTP connection requests) but the JNLP port settings haven't been applied yet (so a connection to the JNLP port would fail).
Here is my analysis of the situation:
- The Jenkins master went down, then started initializing again.
- Swarm client successfully communicated with the master via HTTP to create the new agent (see the hudson.plugins.swarm.SwarmClient createSwarmSlave line above, which clearly shows the Swarm client was able to successfully communicate with the master over HTTP).
- Swarm client then delegated to Remoting, calling hudson.remoting.jnlp.Main#main, which called _main, which called another main, which called Engine#startEngine. We know this because we see the "Using Remoting version" and "Using custom JAR Cache" lines above.
- Engine#startEngine started a thread, which invoked Engine#run. We know this because the org.jenkinsci.remoting.protocol.IOHub create log line was printed.
- Engine#run got all the way into Engine#innerRun, which got as far as the endpoint = resolver.resolve() call on line 523. We know this because the log statement "Locating server among..." was printed.
- In JnlpAgentEndpoint#resolve, we successfully made an HTTP call to the server to list the available protocols. Again, we know this because the log statement "Remoting server accepts the following protocols" was printed.
- In JnlpAgentEndpoint#resolve, we call isPortVisible, and here is where things go haywire. At this point, the JNLP port is not available yet, even though the server is responding to HTTP requests, presumably because my Groovy initialization script hasn't run yet. We get the error http://example.com/ provided port:55000 is not reachable from JnlpAgentEndpointResolver#resolve.
- isPortVisible returns false to JnlpAgentEndpointResolver#resolve, which sets firstError to a new IOException, then continues. We have nothing else to loop through, so we get to the bottom of the method and throw firstError, which in this case is the IOException.
- The caller of JnlpAgentEndpointResolver#resolve, Engine#innerRun, catches the exception and returns.
- Back in Engine#run, innerRun returns and then run returns. At this point the thread dies. We pop the stack all the way back up to Main#main and ultimately back to the Swarm Client itself, which exits.
Clearly, this is a suboptimal outcome. (In practice, it took down a bunch of my test automation tonight.) What are your thoughts on how this problem could be solved? Here are some of mine.
- Ideally, Jenkins core could not respond to HTTP requests until the JNLP port is available. Unfortunately, I don't see a practical way to make this the reality. There doesn't seem to be a way to set the JNLP port early on in Jenkins startup today (hence the need for my Groovy initialization script). I'm not sure whether or not it's feasible to try and add such an option. Even if it were feasible, I still don't know enough about Jenkins early initialization to be able to guarantee that it would close the race. This seems like the ideal solution in the long term, but it's quite impractical for the short or medium term.
- Could we have Remoting try a little harder, knowing that there is a race between the Jenkins master being available via HTTP and JNLP? In practice this race is very small, and I rarely hit it. Retrying up to a minute or so, with a bit of backoff along the way, might be "good enough". Would this be a direction you want to go in? This option appeals to me because it seems more realistic to implement, and it would also benefit non-Swarm JNLP clients. This seems to be the best medium-term solution.
- Should we have Swarm somehow detect this condition and re-invoke hudson.remoting.jnlp.Main#main? Swarm already has command-line options for retries, so we could take advantage of one of those to try and restart the JNLP client if the thread dies for some reason. This seems a bit sub-optimal, since it would only benefit Swarm clients and not other JNLP clients. But it could be done as a short-term solution.
Let me know what you think about my analysis and these possible solutions. I'd be happy to collaborate with you to get this fixed.