Hi Jenkins team/community
we started to observe that our cloud agents sometimes (roughly fail rate is 10% 0.5% of launches) fail to come up online within first attempt. Here it a timeline and logs of such case:
~13:40 agent was created by cloud plugin (VMware in this case, but same problem was observed on kubernetes cloud also, so i suggest cloud type is not important)
Here is agent log
Jun 14, 2019 1:40:29 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using c:\remoting as a remoting work directory Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: dev-take-onescm-build__1 Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Jun 14, 2019 1:40:30 PM hudson.remoting.Engine startEngine INFO: Using Remoting version: 3.29 Jun 14, 2019 1:40:30 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using c:\remoting as a remoting work directory Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.aureacentral.com/] Jun 14, 2019 1:40:30 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: jenkins.aureacentral.com Agent port: 50000 Identity: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46 Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.aureacentral.com:50000 Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Jun 14, 2019 1:40:31 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46 Jun 14, 2019 1:40:32 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected
here is part of server log related to same agent
Jun 14, 2019 1:40:36 PM org.jenkinsci.plugins.vSphereCloud InternalLog INFO: [dev-take-onescm-build__1] Slave has come online. Jun 14, 2019 1:40:36 PM org.jenkinsci.plugins.vSphereCloud InternalLog INFO: [dev-take-onescm-build__1] Slave online Jun 14, 2019 1:40:37 PM org.jenkinsci.plugins.vSphereCloud$VSpherePlannedNode$1 call INFO: Provisioned new slave dev-take-onescm-build__1 Jun 14, 2019 1:41:17 PM hudson.slaves.NodeProvisioner$2 run INFO: dev-take-onescm-build__1 provisioning successfully completed. We have now 674 computer(s)
but agent did not show up online at 13:40 as you can see on load statistic screenshot (0 executors available)
I waited 5 minutes and restarted jenkins windows service on agent VM at 13:46,
here is agent log
Jun 14, 2019 1:46:32 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using c:\remoting as a remoting work directory Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: dev-take-onescm-build__1 Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Jun 14, 2019 1:46:32 PM hudson.remoting.Engine startEngine INFO: Using Remoting version: 3.29 Jun 14, 2019 1:46:32 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using c:\remoting as a remoting work directory Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.aureacentral.com/] Jun 14, 2019 1:46:32 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: jenkins.aureacentral.com Agent port: 50000 Identity: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46 Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.aureacentral.com:50000 Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46 Jun 14, 2019 1:46:33 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Jun 14, 2019 1:46:36 PM org.jenkinsci.remoting.util.AnonymousClassWarnings warn WARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.envinject.EnvInjectComputerListener$2; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
Server log
WARNING: IOHub#1: Worker[channel:java.nio.channels.SocketChannel[connected local=/10.25.33.149:50000 remote=10.25.78.118/10.25.78.118:49686]] / Computer.threadPoolForRemoting [#147073] for dev-take-onescm-build__1 terminated java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179) at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Jun 14, 2019 1:46:27 PM org.jenkinsci.plugins.vSphereCloud InternalLog INFO: [dev-take-onescm-build__1] Running disconnect procedure... Jun 14, 2019 1:46:27 PM org.jenkinsci.plugins.vSphereCloud InternalLog INFO: [dev-take-onescm-build__1] Disconnect done. Performing idle action NOTHING... Jun 14, 2019 1:46:27 PM org.jenkinsci.plugins.vSphereCloud InternalLog INFO: [dev-take-onescm-build__1] Idle action NOTHING complete.
and after that agent connected and come online and started job execution , (see load statistic screenshot above)
We upgraded to 2.164.3 from 2.164.1 last week and noticed this issue but we are not 100% sure it wasn't there before upgrade.
Any Ideas or direction to troubleshoot further are appreciated
Thanks