Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58020

agent connected via TCP but executor does not show up online sometimes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • core, remoting
    • None
    • Jenkins 2.176.1 , Remoting 3.29

      Hi Jenkins team/community

      we started to observe that our cloud agents sometimes (roughly fail rate is 10% 0.5% of launches) fail to come up online  within first attempt. Here it a timeline  and logs of such case:

      ~13:40 agent was created by cloud plugin (VMware in this case, but same problem was observed on kubernetes cloud also, so i suggest cloud type is not important)

      Here is agent log

      Jun 14, 2019 1:40:29 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using c:\remoting as a remoting work directory
      Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up agent: dev-take-onescm-build__1
      Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Jun 14, 2019 1:40:30 PM hudson.remoting.Engine startEngine
      INFO: Using Remoting version: 3.29
      Jun 14, 2019 1:40:30 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using c:\remoting as a remoting work directory
      Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among [http://jenkins.aureacentral.com/]
      Jun 14, 2019 1:40:30 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
      Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
        Agent address: jenkins.aureacentral.com
        Agent port:    50000
        Identity:      a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46
      Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins.aureacentral.com:50000
      Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect
      Jun 14, 2019 1:40:31 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Remote identity confirmed: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46
      Jun 14, 2019 1:40:32 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected

      here is part of server log related to same agent

      Jun 14, 2019 1:40:36 PM org.jenkinsci.plugins.vSphereCloud InternalLog
      INFO: [dev-take-onescm-build__1] Slave has come online.
      Jun 14, 2019 1:40:36 PM org.jenkinsci.plugins.vSphereCloud InternalLog
      INFO: [dev-take-onescm-build__1] Slave online
      Jun 14, 2019 1:40:37 PM org.jenkinsci.plugins.vSphereCloud$VSpherePlannedNode$1 call
      INFO: Provisioned new slave dev-take-onescm-build__1
      Jun 14, 2019 1:41:17 PM hudson.slaves.NodeProvisioner$2 run
      INFO: dev-take-onescm-build__1 provisioning successfully completed. We have now 674 computer(s)
      

      but agent did not show up online  at 13:40 as you can see on load statistic screenshot (0 executors available) 

      I waited 5 minutes and restarted jenkins windows service on agent VM at 13:46,

      here is agent log

      Jun 14, 2019 1:46:32 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using c:\remoting as a remoting work directory
      Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up agent: dev-take-onescm-build__1
      Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Jun 14, 2019 1:46:32 PM hudson.remoting.Engine startEngine
      INFO: Using Remoting version: 3.29
      Jun 14, 2019 1:46:32 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using c:\remoting as a remoting work directory
      Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among [http://jenkins.aureacentral.com/]
      Jun 14, 2019 1:46:32 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
      Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
        Agent address: jenkins.aureacentral.com
        Agent port:    50000
        Identity:      a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46
      Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins.aureacentral.com:50000
      Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect
      Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Remote identity confirmed: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46
      Jun 14, 2019 1:46:33 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected
      Jun 14, 2019 1:46:36 PM org.jenkinsci.remoting.util.AnonymousClassWarnings warn
      WARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.envinject.EnvInjectComputerListener$2; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
      

      Server log

      WARNING: IOHub#1: Worker[channel:java.nio.channels.SocketChannel[connected local=/10.25.33.149:50000 remote=10.25.78.118/10.25.78.118:49686]] / Computer.threadPoolForRemoting [#147073] for dev-take-onescm-build__1 terminated
      java.nio.channels.ClosedChannelException
              at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
              at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)
              at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
              at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
              at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)Jun 14, 2019 1:46:27 PM org.jenkinsci.plugins.vSphereCloud InternalLog
      INFO: [dev-take-onescm-build__1] Running disconnect procedure...
      Jun 14, 2019 1:46:27 PM org.jenkinsci.plugins.vSphereCloud InternalLog
      INFO: [dev-take-onescm-build__1] Disconnect done.  Performing idle action NOTHING...
      Jun 14, 2019 1:46:27 PM org.jenkinsci.plugins.vSphereCloud InternalLog
      INFO: [dev-take-onescm-build__1] Idle action NOTHING complete.
      

      and after that agent connected and come online and started  job execution , (see load statistic screenshot above)

      We upgraded to 2.164.3 from 2.164.1 last week and noticed this issue but we are not 100% sure it wasn't there before upgrade.

      Any Ideas or direction to troubleshoot further are appreciated

      Thanks

       

       

            Unassigned Unassigned
            iceiceice Alexey Grigorov
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: