Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-57831

Remoting susceptible to race between HTTP availability and JNLP availability during master initialization

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Incomplete
    • Component/s: remoting
    • Labels:
      None
    • Environment:
      Jenkins 2.150.1
      Swarm client 3.17 (Remoting 3.30)
    • Similar Issues:

      Description

      Hi, I'm the new maintainer of the Swarm Plugin. I encountered an issue with tonight after doing a routine restart of a Jenkins master (to perform a plugin update) that resulted in all my Swarm clients losing their connection to that master (but not my other masters). I explain the details below. I'd welcome your thoughts on my root cause analysis below, and I'd be happy to collaborate on a solution with you.

      Problem

      Typically, my Swarm clients reconnect just fine after a master restarts due to my use of the Swarm client -deleteExistingClients feature. In fact, I even have a unit test for this functionality. And tonight, Swarm clients successfully reconnected when all of my Jenkins masters were restarted, except for one. On that single master (but not the others), all the Swarm clients failed to reconnect. The Swarm client logs on all the failed clients showed messages like the following:

      2019-06-04 03:08:24 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl Connecting to http://example.com/ to configure swarm client.
      2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClient createHttpClient() invoked
      2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext createHttpClientContext() invoked
      2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext Setting HttpClient credentials based on options passed
      2019-06-04 03:08:24 FINE hudson.plugins.swarm.Candidate <init> Candidate constructed with url: http://example.com/, secret: <redacted>
      2019-06-04 03:08:24 INFO hudson.plugins.swarm.Client run Attempting to connect to http://example.com/ <redacted> with ID 
      2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createSwarmSlave createSwarmSlave() invoked
      2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClient createHttpClient() invoked
      2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext createHttpClientContext() invoked
      2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext Setting HttpClient credentials based on options passed
      2019-06-04 03:08:24 SEVERE hudson.plugins.swarm.SwarmClient getCsrfCrumb Could not obtain CSRF crumb. Response code: 404
      2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient connect connect() invoked
      2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection writeRequests sun.net.www.MessageHeader@4973813a6 pairs: {GET /computer/vm.example.com/slave-agent.jnlp HTTP/1.1: null}{Authorization: Basic YmxhY2tib3g6YmxhY2tib3gxMjM=}{User-Agent: Java/1.8.0_202}{Host: example.com}{Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2}{Connection: keep-alive}
      2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection getInputStream0 sun.net.www.MessageHeader@6321e8137 pairs: {null: HTTP/1.1 200 OK}{Date: Tue, 04 Jun 2019 03:08:24 GMT}{Content-Type: application/x-java-jnlp-file}{Content-Length: 772}{Connection: keep-alive}{X-Content-Type-Options: nosniff}{Server: Jetty(9.4.z-SNAPSHOT)}
      2019-06-04 03:08:24 INFO hudson.remoting.jnlp.Main createEngine Setting up agent: vm.example.com
      2019-06-04 03:08:24 INFO hudson.remoting.jnlp.Main$CuiListener <init> Jenkins agent is running in headless mode.
      2019-06-04 03:08:24 INFO hudson.remoting.Engine startEngine Using Remoting version: 3.30
      2019-06-04 03:08:24 WARNING hudson.remoting.Engine startEngine No Working Directory. Using the legacy JAR Cache location: /var/tmp/jenkins/.jenkins/cache/jars
      2019-06-04 03:08:24 FINE hudson.remoting.Engine startEngine Using standard File System JAR Cache. Root Directory is /var/tmp/jenkins/.jenkins/cache/jars
      2019-06-04 03:08:24 FINE org.jenkinsci.remoting.protocol.IOHub create Staring an additional Selector wakeup thread. See JENKINS-47965 for more info
      2019-06-04 03:08:24 INFO hudson.remoting.jnlp.Main$CuiListener status Locating server among [http://example.com/]
      2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection writeRequests sun.net.www.MessageHeader@4ae378ec6 pairs: {GET /tcpSlaveAgentListener/ HTTP/1.1: null}{Authorization: Basic YmxhY2tib3g6YmxhY2tib3gxMjM=}{User-Agent: Java/1.8.0_202}{Host: example.com}{Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2}{Connection: keep-alive}
      2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection getInputStream0 sun.net.www.MessageHeader@fe1efd411 pairs: {null: HTTP/1.1 200 OK}{Date: Tue, 04 Jun 2019 03:08:24 GMT}{Content-Type: text/plain;charset=utf-8}{Content-Length: 12}{Connection: keep-alive}{X-Content-Type-Options: nosniff}{X-Hudson-JNLP-Port: 55000}{X-Jenkins-JNLP-Port: 55000}{X-Instance-Identity: <redacted>}{X-Jenkins-Agent-Protocols: JNLP4-connect, Ping}{Server: Jetty(9.4.z-SNAPSHOT)}
      2019-06-04 03:08:24 INFO org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve Remoting server accepts the following protocols: [JNLP4-connect, Ping]
      2019-06-04 03:08:24 WARNING org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver isPortVisible Connection refused (Connection refused)
      2019-06-04 03:08:24 SEVERE hudson.remoting.jnlp.Main$CuiListener error http://example.com/ provided port:55000 is not reachable
      java.io.IOException: http://example.com/ provided port:55000 is not reachable
              at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:287)
              at hudson.remoting.Engine.innerRun(Engine.java:523)
              at hudson.remoting.Engine.run(Engine.java:474)
      

      At this point, the Swarm client exited (with unknown exit code) and never recovered. The higher-level Jenkins jobs failed.

      Background

      Of note is that my masters have a custom JNLP port, which I set from a Groovy initialization script like so:

      def env = System.getenv()
      int port = env['JENKINS_SLAVE_AGENT_PORT'].toInteger()
      if (Jenkins.get().slaveAgentPort != port) {
        Jenkins.get().slaveAgentPort = port
        Jenkins.get().save()
      }
      

      So for a short period of time during Jenkins initialization, before this Groovy initialization script is running, Jenkins is up (and therefore replying to HTTP connection requests) but the JNLP port settings haven't been applied yet (so a connection to the JNLP port would fail).

      Analysis

      Here is my analysis of the situation:

      1. The Jenkins master went down, then started initializing again.
      2. Swarm client successfully communicated with the master via HTTP to create the new agent (see the hudson.plugins.swarm.SwarmClient createSwarmSlave line above, which clearly shows the Swarm client was able to successfully communicate with the master over HTTP).
      3. Swarm client then delegated to Remoting, calling hudson.remoting.jnlp.Main#main, which called _main, which called another main, which called Engine#startEngine. We know this because we see the "Using Remoting version" and "Using custom JAR Cache" lines above.
      4. Engine#startEngine started a thread, which invoked Engine#run. We know this because the org.jenkinsci.remoting.protocol.IOHub create log line was printed.
      5. Engine#run got all the way into Engine#innerRun, which got as far as the endpoint = resolver.resolve() call on line 523. We know this because the log statement "Locating server among..." was printed.
      6. In JnlpAgentEndpoint#resolve, we successfully made an HTTP call to the server to list the available protocols. Again, we know this because the log statement "Remoting server accepts the following protocols" was printed.
      7. In JnlpAgentEndpoint#resolve, we call isPortVisible, and here is where things go haywire. At this point, the JNLP port is not available yet, even though the server is responding to HTTP requests, presumably because my Groovy initialization script hasn't run yet. We get the error http://example.com/ provided port:55000 is not reachable from JnlpAgentEndpointResolver#resolve.
      8. isPortVisible returns false to JnlpAgentEndpointResolver#resolve, which sets firstError to a new IOException, then continues. We have nothing else to loop through, so we get to the bottom of the method and throw firstError, which in this case is the IOException.
      9. The caller of JnlpAgentEndpointResolver#resolve, Engine#innerRun, catches the exception and returns.
      10. Back in Engine#run, innerRun returns and then run returns. At this point the thread dies. We pop the stack all the way back up to Main#main and ultimately back to the Swarm Client itself, which exits.

      Possible Solutions

      Clearly, this is a suboptimal outcome. (In practice, it took down a bunch of my test automation tonight.) What are your thoughts on how this problem could be solved? Here are some of mine.

      • Ideally, Jenkins core could not respond to HTTP requests until the JNLP port is available. Unfortunately, I don't see a practical way to make this the reality. There doesn't seem to be a way to set the JNLP port early on in Jenkins startup today (hence the need for my Groovy initialization script). I'm not sure whether or not it's feasible to try and add such an option. Even if it were feasible, I still don't know enough about Jenkins early initialization to be able to guarantee that it would close the race. This seems like the ideal solution in the long term, but it's quite impractical for the short or medium term.
      • Could we have Remoting try a little harder, knowing that there is a race between the Jenkins master being available via HTTP and JNLP? In practice this race is very small, and I rarely hit it. Retrying up to a minute or so, with a bit of backoff along the way, might be "good enough". Would this be a direction you want to go in? This option appeals to me because it seems more realistic to implement, and it would also benefit non-Swarm JNLP clients. This seems to be the best medium-term solution.
      • Should we have Swarm somehow detect this condition and re-invoke hudson.remoting.jnlp.Main#main? Swarm already has command-line options for retries, so we could take advantage of one of those to try and restart the JNLP client if the thread dies for some reason. This seems a bit sub-optimal, since it would only benefit Swarm clients and not other JNLP clients. But it could be done as a short-term solution.

      Let me know what you think about my analysis and these possible solutions. I'd be happy to collaborate with you to get this fixed.

        Attachments

          Activity

          Hide
          basil Basil Crow added a comment -

          This seems to be somewhat opposite to what Javier Delgado was working with in #JENKINS-46515 , Remoting PR#193(https://github.com/jenkinsci/remoting/pull/193). In that case he was trying to exit the agent process more quickly to have things kick off again whereas you are trying to get it not exit so quickly.

          Swarm has a number of retry-related options:

           -noRetryAfterConnected                 : Do not retry if a successful
                                                    connection gets closed.
           -retry N                               : Number of retries before giving up.
                                                    Unlimited if not specified.
           -retryBackOffStrategy RETRY_BACK_OFF_S : The mode controlling retry wait time.
           TRATEGY                                  Can be either 'none' (use same
                                                    interval between retries) or 'linear'
                                                    (increase wait time before each retry
                                                    up to maxRetryInterval) or
                                                    'exponential' (double wait interval
                                                    on each retry up to maxRetryInterval).
                                                    Default is 'none'.
           -retryInterval N                       : Time to wait before retry in seconds.
                                                    Default is 10 seconds.
          

          When these options are being used, Swarm wants the process to keep running, so Remoting's use of System.exit is problematic. In this case, Swarm really wants Remoting to pass control back to Swarm on failure so that Swarm can retry. Today this takes place mostly by popping the stack back up to Remoting's main() so that Swarm can call Remoting's main again. But this interface is an implementation detail of Swarm/Remoting and we could always redefine that interface (for example, to use some specific exception type) if desired.

          Javier's use case seems a bit different; that use case seems more aligned with a situation where you have a service manager (e.g. systemd) that is monitoring the process and restarting it on failure, e.g.:

                          [Unit]
                          Description=Swarm client
                          Requires=network.target
                          After=local-fs.target
                          After=network.target
          
                          [Service]
                          Type=simple
                          WorkingDirectory=/var/lib/swarm-client
                          ExecStart=java -jar swarm-client.jar [...] 
                          Restart=on-failure
                          RestartSec=5
          
                          #
                          # If the Swarm client loses its connection to the master and
                          # needs to be restarted, we don't want to interrupt its child
                          # processes, which the new Swarm client process will find when
                          # it resumes its connection.
                          #
                          KillMode=process
          
                          #
                          # The JVM's exit status when it receives a SIGTERM (15) will be
                          # 128+15=143, so consider that clean exit.
                          #
                          SuccessExitStatus=143
          
                          [Install]
                          WantedBy=multi-user.target
          

          With the above type of setup, you'd be relying on the client to exit non-zero so that systemd could restart it.

          I think ultimately the current semantics are a bit muddy and could use some cleaning up. This could be done while preserving backwards compatibility by adding a new flags to Remoting/Swarm. Here's a rough sketch of what the new semantics could look like in the long term:

          • "Exit non-zero on failure": Add a new flag that means "fail fast and don't try to keep the process going". Indicate failure with a non-zero exit status and rely on the caller to retry. The caller would be expected to be something like systemd that has its own retry mechanism built-in. Ideally the flag would be named the same for both Swarm and Remoting. If the flag is passed to Swarm, Swarm would plumb it through to Remoting.
          • We'd want the opposite of the above flag, either as its own flag (i.e. "keep trying") or as the default when no flags are specified. Currently things are "mostly" this way by default with the exception of the harsh System.exit I pointed out above. We could change the default behavior there (and risk some breakage), or have a truly no-risk solution in the form a new "keep trying" flag. If we do the latter, there would be three modes: "fail fast" (new flag), "keep trying" (new flag), and default (i.e. maintain the existing confusing semantics).

          I don't see a need to implement any of this right away given that Pedro's JNLP port solution is probably going to chase away the problem for me in the short term. But it's worth keeping this in mind as a possible design direction in the long term.

          Show
          basil Basil Crow added a comment - This seems to be somewhat opposite to what Javier Delgado was working with in # JENKINS-46515 , Remoting PR#193( https://github.com/jenkinsci/remoting/pull/193 ). In that case he was trying to exit the agent process more quickly to have things kick off again whereas you are trying to get it not exit so quickly. Swarm has a number of retry-related options: -noRetryAfterConnected : Do not retry if a successful connection gets closed. -retry N : Number of retries before giving up. Unlimited if not specified. -retryBackOffStrategy RETRY_BACK_OFF_S : The mode controlling retry wait time. TRATEGY Can be either 'none' (use same interval between retries) or 'linear' (increase wait time before each retry up to maxRetryInterval) or 'exponential' (double wait interval on each retry up to maxRetryInterval). Default is 'none'. -retryInterval N : Time to wait before retry in seconds. Default is 10 seconds. When these options are being used, Swarm wants the process to keep running, so Remoting's use of System.exit is problematic. In this case, Swarm really wants Remoting to pass control back to Swarm on failure so that Swarm can retry. Today this takes place mostly by popping the stack back up to Remoting's main() so that Swarm can call Remoting's main again. But this interface is an implementation detail of Swarm/Remoting and we could always redefine that interface (for example, to use some specific exception type) if desired. Javier's use case seems a bit different; that use case seems more aligned with a situation where you have a service manager (e.g. systemd) that is monitoring the process and restarting it on failure, e.g.: [Unit] Description=Swarm client Requires=network.target After=local-fs.target After=network.target [Service] Type=simple WorkingDirectory=/var/lib/swarm-client ExecStart=java -jar swarm-client.jar [...] Restart=on-failure RestartSec=5 # # If the Swarm client loses its connection to the master and # needs to be restarted, we don't want to interrupt its child # processes, which the new Swarm client process will find when # it resumes its connection. # KillMode=process # # The JVM's exit status when it receives a SIGTERM (15) will be # 128+15=143, so consider that clean exit. # SuccessExitStatus=143 [Install] WantedBy=multi-user.target With the above type of setup, you'd be relying on the client to exit non-zero so that systemd could restart it. I think ultimately the current semantics are a bit muddy and could use some cleaning up. This could be done while preserving backwards compatibility by adding a new flags to Remoting/Swarm. Here's a rough sketch of what the new semantics could look like in the long term: "Exit non-zero on failure": Add a new flag that means "fail fast and don't try to keep the process going". Indicate failure with a non-zero exit status and rely on the caller to retry. The caller would be expected to be something like systemd that has its own retry mechanism built-in. Ideally the flag would be named the same for both Swarm and Remoting. If the flag is passed to Swarm, Swarm would plumb it through to Remoting. We'd want the opposite of the above flag, either as its own flag (i.e. "keep trying") or as the default when no flags are specified. Currently things are "mostly" this way by default with the exception of the harsh System.exit I pointed out above. We could change the default behavior there (and risk some breakage), or have a truly no-risk solution in the form a new "keep trying" flag. If we do the latter, there would be three modes: "fail fast" (new flag), "keep trying" (new flag), and default (i.e. maintain the existing confusing semantics). I don't see a need to implement any of this right away given that Pedro's JNLP port solution is probably going to chase away the problem for me in the short term. But it's worth keeping this in mind as a possible design direction in the long term.
          Hide
          jthompson Jeff Thompson added a comment -

          You've described the scenarios and status well. As you note, if the system property for JNLP port works there probably isn't much of a need to make a change at this point. I'll keep this in mind and see if any more information or needs come up. I think any change we might make would have to preserve the existing, default behavior. It's just too complicated to predict who might be relying on what. If Javier Delgado still has an identified need for the "Exit non-zero on failure" we could try again to implement that without disruption.

          Show
          jthompson Jeff Thompson added a comment - You've described the scenarios and status well. As you note, if the system property for JNLP port works there probably isn't much of a need to make a change at this point. I'll keep this in mind and see if any more information or needs come up. I think any change we might make would have to preserve the existing, default behavior. It's just too complicated to predict who might be relying on what. If Javier Delgado still has an identified need for the "Exit non-zero on failure" we could try again to implement that without disruption.
          Hide
          jthompson Jeff Thompson added a comment -

          Basil Crow, haven't you gotten any further in validating your needs are met by existing functionality?

          Show
          jthompson Jeff Thompson added a comment - Basil Crow , haven't you gotten any further in validating your needs are met by existing functionality?
          Hide
          basil Basil Crow added a comment -

          I deployed Pedro's changes from jenkinsci/docker#805 on Monday, and since then I've done one Jenkins master restart which went off without a hitch. So far, use of the jenkins.model.Jenkins.slaveAgentPort Java property seems to have chased away the problem.

          Show
          basil Basil Crow added a comment - I deployed Pedro's changes from jenkinsci/docker#805 on Monday, and since then I've done one Jenkins master restart which went off without a hitch. So far, use of the jenkins.model.Jenkins.slaveAgentPort Java property seems to have chased away the problem.
          Hide
          jthompson Jeff Thompson added a comment -

          Great. It looks like the possible value of making any changes at this time has declined. I'll go ahead and close this issue and the related PR. We'll see if another scenario comes up that could be the driver for some init sequence changes.

          Show
          jthompson Jeff Thompson added a comment - Great. It looks like the possible value of making any changes at this time has declined. I'll go ahead and close this issue and the related PR. We'll see if another scenario comes up that could be the driver for some init sequence changes.

            People

            • Assignee:
              jthompson Jeff Thompson
              Reporter:
              basil Basil Crow
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: