Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49816

swarm node says connected succesffuly, but master has placed it offline

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Component/s: remoting
    • Labels:
      None
    • Environment:
      Jenkins ver. 2.89.4
      Swarm 3.9
    • Similar Issues:

      Description

      We spin up 1000's of nodes with swarm per month.

      Every month we encounter a few scenarios where the swarm agent says it connected successfully, but the jenkins master does not show it.

      The node has these logs (notice it does not say "INFO: Connected", which it usually does):

      Swarm Logs

      INFO: Client.main invoked with: [-name eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be -description Created by Swarm. InstanceID=i-03918a0ef1ef6d8be AmiId=ami-a030b2d8 -executors 1 -fsroot /mnt/ope/ws -labels eod-us-west-2_spot_m3.xlarge -master https://jenkins.clearcare.it/ -mode normal -retry 30 -username sre@clearcareonline.com -password nJ0yuLYBcOJE -disableSslVerification]
      Feb 28, 2018 7:49:57 PM hudson.plugins.swarm.Client run
      INFO: Discovering Jenkins master
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.Client run
      INFO: Attempting to connect to https://jenkins.clearcare.it/ ea7ab441-78d0-4548-a571-5feaae0be121 with ID fd8127ce
      Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
      SEVERE: Could not obtain CSRF crumb. Response code: 404
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among https://jenkins.foo.it/
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
      Agent address: jenkins.foo.it
      Agent port: 30001
      Identity: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins.foo.it:30001
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Remote identity confirmed: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec

      On the master logs, I see this:
      WARNING: Making eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce offline because it’s not responding

      Restarting the java process does the trick, but I hate manually doing this.
      It seems the swarm jar gets stuck after the log, "Remote identity confirmed".

      Again, out of 1000 times a month, this issue occurs maybe 2-4 times.

        Attachments

          Activity

          grayaii Alex Gray created issue -
          grayaii Alex Gray made changes -
          Field Original Value New Value
          Description We spin up 1000's of nodes with swarm per month.

          Every month we encounter a few scenarios where the swarm agent says it connected successfully, but the jenkins master does not show it.

          The node has these logs (notice it does not say "INFO: Connected", which it usually does):
          {panel:title=Swarm Logs}
          INFO: Client.main invoked with: [-name eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be -description Created by Swarm. InstanceID=i-03918a0ef1ef6d8be AmiId=ami-a030b2d8 -executors 1 -fsroot /mnt/ope/ws -labels eod-us-west-2_spot_m3.xlarge -master https://jenkins.clearcare.it/ -mode normal -retry 30 -username sre@clearcareonline.com -password nJ0yuLYBcOJE -disableSslVerification]
          Feb 28, 2018 7:49:57 PM hudson.plugins.swarm.Client run
          INFO: Discovering Jenkins master
          SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
          SLF4J: Defaulting to no-operation (NOP) logger implementation
          SLF4J: See [http://www.slf4j.org/codes.html#StaticLoggerBinder] for further details.
          Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.Client run
          INFO: Attempting to connect to [https://jenkins.clearcare.it/] ea7ab441-78d0-4548-a571-5feaae0be121 with ID fd8127ce
          Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
          SEVERE: Could not obtain CSRF crumb. Response code: 404
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [https://jenkins.foo.it/]
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Agent discovery successful
          Agent address: jenkins.foo.it
          Agent port: 30001
          Identity: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.foo.it:30001
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP4-connect
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Remote identity confirmed: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec
          {panel}
          On the master logs, I see this:
          WARNING: Making eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce offline because it’s not responding

          Restarting the java process does the trick, but I hate manually doing this.
          It seems the swarm jar gets stuck after the log, "Remote identity confirmed".
          We spin up 1000's of nodes with swarm per month.

          Every month we encounter a few scenarios where the swarm agent says it connected successfully, but the jenkins master does not show it.

          The node has these logs (notice it does not say "INFO: Connected", which it usually does):
          {panel:title=Swarm Logs}
          INFO: Client.main invoked with: [-name eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be -description Created by Swarm. InstanceID=i-03918a0ef1ef6d8be AmiId=ami-a030b2d8 -executors 1 -fsroot /mnt/ope/ws -labels eod-us-west-2_spot_m3.xlarge -master https://jenkins.clearcare.it/ -mode normal -retry 30 -username sre@clearcareonline.com -password nJ0yuLYBcOJE -disableSslVerification]
          Feb 28, 2018 7:49:57 PM hudson.plugins.swarm.Client run
          INFO: Discovering Jenkins master
          SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
          SLF4J: Defaulting to no-operation (NOP) logger implementation
          SLF4J: See [http://www.slf4j.org/codes.html#StaticLoggerBinder] for further details.
          Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.Client run
          INFO: Attempting to connect to [https://jenkins.clearcare.it/] ea7ab441-78d0-4548-a571-5feaae0be121 with ID fd8127ce
          Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
          SEVERE: Could not obtain CSRF crumb. Response code: 404
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [https://jenkins.foo.it/]
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Agent discovery successful
          Agent address: jenkins.foo.it
          Agent port: 30001
          Identity: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.foo.it:30001
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP4-connect
          Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Remote identity confirmed: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec
          {panel}
          On the master logs, I see this:
          WARNING: Making eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce offline because it’s not responding

          Restarting the java process does the trick, but I hate manually doing this.
          It seems the swarm jar gets stuck after the log, "Remote identity confirmed".

          Again, out of 1000 times a month, this issue occurs maybe 2-4 times.
          oleg_nenashev Oleg Nenashev made changes -
          Component/s remoting [ 15489 ]
          Component/s swarm-plugin [ 15741 ]
          oleg_nenashev Oleg Nenashev made changes -
          Assignee Oleg Nenashev [ oleg_nenashev ] Jeff Thompson [ jthompson ]
          jthompson Jeff Thompson made changes -
          Status Open [ 1 ] Closed [ 6 ]
          Resolution Cannot Reproduce [ 5 ]

            People

            • Assignee:
              jthompson Jeff Thompson
              Reporter:
              grayaii Alex Gray
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: