Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-72163

Retry on initial connection failure occurs in one entrypoint but not the other

XMLWordPrintable

    • 3160.vd76b_9ddd10cc

      Problem

      I talked to a user running Kubernetes agents in a cluster where the controller was not immediately reachable over the network after spinning up the agent. Rather, it took 30 seconds or so for the controller to become reachable over the network. While admitting this networking setup was not ideal, the user expected Remoting to be resilient to this scenario, but it was not. Rather, Remoting printed the following exception and then terminated, never trying again:

      java.io.IOException: Failed to connect to http://example.com/tcpSlaveAgentListener/: Connection refused
      	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:216)
      	at hudson.remoting.Engine.innerRun(Engine.java:761)
      	at hudson.remoting.Engine.run(Engine.java:543)
      Caused by: java.net.ConnectException: Connection refused
      	at java.base/sun.nio.ch.Net.pollConnect(Native Method)
      	at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
      	at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:547)
      	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
      	at java.base/java.net.Socket.connect(Socket.java:633)
      	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:178)
      	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:533)
      	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:638)
      	at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:281)
      	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:386)
      	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:408)
      	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1309)
      	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1242)
      	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1128)
      	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1057)
      	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:213)
      	... 2 more
      

      Evaluation

      There are two public static void main() entrypoints into Remoting: hudson.remoting.Launcher (used by java -jar remoting.jar -jnlpUrl […]) and hudson.remoting.jnlp.Main (used by java -cp remoting.jar hudson.remoting.jnlp.Main […] -url […], which was the entrypoint being used by this user). The first of these is a thin wrapper around the second when -jnlpUrl is passed in, and if the controller is not available it keeps retrying every 10 seconds (unless -noReconnect is specified) until the controller is available before vectoring into the second entrypoint. If the connection is interrupted after it is established, we again retry every 10 seconds (unless noReconnect is specified). But there is a gap in retry coverage—if the second entrypoint is invoked directly (rather than via the first entrypoint), as was the case with this user, and the controller is not available at the time the initial connection is made, no retries will occur.

            basil Basil Crow
            basil Basil Crow
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: