Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49816

swarm node says connected succesffuly, but master has placed it offline

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Component/s: remoting
    • Labels:
      None
    • Environment:
      Jenkins ver. 2.89.4
      Swarm 3.9
    • Similar Issues:

      Description

      We spin up 1000's of nodes with swarm per month.

      Every month we encounter a few scenarios where the swarm agent says it connected successfully, but the jenkins master does not show it.

      The node has these logs (notice it does not say "INFO: Connected", which it usually does):

      Swarm Logs

      INFO: Client.main invoked with: [-name eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be -description Created by Swarm. InstanceID=i-03918a0ef1ef6d8be AmiId=ami-a030b2d8 -executors 1 -fsroot /mnt/ope/ws -labels eod-us-west-2_spot_m3.xlarge -master https://jenkins.clearcare.it/ -mode normal -retry 30 -username sre@clearcareonline.com -password nJ0yuLYBcOJE -disableSslVerification]
      Feb 28, 2018 7:49:57 PM hudson.plugins.swarm.Client run
      INFO: Discovering Jenkins master
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.Client run
      INFO: Attempting to connect to https://jenkins.clearcare.it/ ea7ab441-78d0-4548-a571-5feaae0be121 with ID fd8127ce
      Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
      SEVERE: Could not obtain CSRF crumb. Response code: 404
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among https://jenkins.foo.it/
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
      Agent address: jenkins.foo.it
      Agent port: 30001
      Identity: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins.foo.it:30001
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Remote identity confirmed: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec

      On the master logs, I see this:
      WARNING: Making eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce offline because it’s not responding

      Restarting the java process does the trick, but I hate manually doing this.
      It seems the swarm jar gets stuck after the log, "Remote identity confirmed".

      Again, out of 1000 times a month, this issue occurs maybe 2-4 times.

        Attachments

          Activity

          Hide
          jthompson Jeff Thompson added a comment -

          I have no idea what's going on here and I haven't seen any other similar reports. Unfortunately there isn't enough information in the logs to figure anything out. It might be possible to figure more out with finer-grained logging but that would make it difficult to sift through it all to find something interesting.

          As Alex Gray notes it's common enough to be a nuisance but not frequent enough to be a big problem. Without more info there's not much to take action on. I don't know there's much reason to hold this ticket open.

          Show
          jthompson Jeff Thompson added a comment - I have no idea what's going on here and I haven't seen any other similar reports. Unfortunately there isn't enough information in the logs to figure anything out. It might be possible to figure more out with finer-grained logging but that would make it difficult to sift through it all to find something interesting. As Alex Gray notes it's common enough to be a nuisance but not frequent enough to be a big problem. Without more info there's not much to take action on. I don't know there's much reason to hold this ticket open.
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          I would argue it is a Remoting issue then, nothing specific to Swarm plugin AFAIK.
          Maybe Jeff Thompson has seen it

          Show
          oleg_nenashev Oleg Nenashev added a comment - I would argue it is a Remoting issue then, nothing specific to Swarm plugin AFAIK. Maybe Jeff Thompson has seen it
          Hide
          grayaii Alex Gray added a comment -

          I was able to reproduce this locally by running a script that basically:
          1.  runs the the "java -jar swarm ..." command to connect the slave the master

          2.  verifies it was connected to the master

          Out of a few thousand loops, it fails to connect (same logs I posted originally... basically the "INFO: Connected" is not present in the logs.)  Maybe if this failed 1/10 times this can be considered a bug, but 1/1000 probably not.

          Show
          grayaii Alex Gray added a comment - I was able to reproduce this locally by running a script that basically: 1.  runs the the "java -jar swarm ..." command to connect the slave the master 2.  verifies it was connected to the master Out of a few thousand loops, it fails to connect (same logs I posted originally... basically the "INFO: Connected" is not present in the logs.)  Maybe if this failed 1/10 times this can be considered a bug, but 1/1000 probably not.
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          Well, with the current info I cannot say whether it is a defect or not.

          Show
          oleg_nenashev Oleg Nenashev added a comment - Well, with the current info I cannot say whether it is a defect or not.
          Hide
          grayaii Alex Gray added a comment -

          Unfortunately, those are the only logs.  I am spinning up pretty hefty AWS spotinstances (c4.xlarge) when this happens.  Out of 1000 spin ups from the same AMI, I get a few of these failures.  Most of our "regular" nodes are t2.small and they don't have any issues, but we only spin up a few of those per month, so it's probably not related to any type of load.  We also have newrelic metrics and there is no load when this is happening.  The workaround is pretty easy:  I have a cron that runs every 5 minutes and if the last line in the log is not "INFO: Connected" after a few minutes, I restart the swarm jar and it works.  I would rate this ticket as "low priority", since it's easy to detect and easy to workaround.

          Show
          grayaii Alex Gray added a comment - Unfortunately, those are the only logs.  I am spinning up pretty hefty AWS spotinstances (c4.xlarge) when this happens.  Out of 1000 spin ups from the same AMI, I get a few of these failures.  Most of our "regular" nodes are t2.small and they don't have any issues, but we only spin up a few of those per month, so it's probably not related to any type of load.  We also have newrelic metrics and there is no load when this is happening.  The workaround is pretty easy:  I have a cron that runs every 5 minutes and if the last line in the log is not "INFO: Connected" after a few minutes, I restart the swarm jar and it works.  I would rate this ticket as "low priority", since it's easy to detect and easy to workaround.
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          Is there anything else in the master log? Maybe the agent becomes unavailable due to the heavy classloading or so

          Show
          oleg_nenashev Oleg Nenashev added a comment - Is there anything else in the master log? Maybe the agent becomes unavailable due to the heavy classloading or so

            People

            • Assignee:
              jthompson Jeff Thompson
              Reporter:
              grayaii Alex Gray
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: