-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Jenkins ver. 2.89.4
Swarm 3.9
We spin up 1000's of nodes with swarm per month.
Every month we encounter a few scenarios where the swarm agent says it connected successfully, but the jenkins master does not show it.
The node has these logs (notice it does not say "INFO: Connected", which it usually does):
INFO: Client.main invoked with: [-name eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be -description Created by Swarm. InstanceID=i-03918a0ef1ef6d8be AmiId=ami-a030b2d8 -executors 1 -fsroot /mnt/ope/ws -labels eod-us-west-2_spot_m3.xlarge -master https://jenkins.clearcare.it/ -mode normal -retry 30 -username sre@clearcareonline.com -password nJ0yuLYBcOJE -disableSslVerification]
Feb 28, 2018 7:49:57 PM hudson.plugins.swarm.Client run
INFO: Discovering Jenkins master
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.Client run
INFO: Attempting to connect to https://jenkins.clearcare.it/ ea7ab441-78d0-4548-a571-5feaae0be121 with ID fd8127ce
Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
SEVERE: Could not obtain CSRF crumb. Response code: 404
Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce
Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among https://jenkins.foo.it/
Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
Agent address: jenkins.foo.it
Agent port: 30001
Identity: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec
Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins.foo.it:30001
Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec
On the master logs, I see this:
WARNING: Making eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce offline because it’s not responding
Restarting the java process does the trick, but I hate manually doing this.
It seems the swarm jar gets stuck after the log, "Remote identity confirmed".
Again, out of 1000 times a month, this issue occurs maybe 2-4 times.