Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45290

windows slaves stops being able to connect to master

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Component/s: core, remoting
    • Environment:
    • Similar Issues:

      Description

      After a while (how long seems to be random, but between days to ~4 weeks after jenkins is started) windows slaves stop being able to connect to jenkins master.
      After a restart of the jenkins service, all slaves can connect without issues. When this issue occur, all existing connections continue to work.
      There are no issues marking nodes as temporarily offline. But if I reboot a windows slave it can't connect at all until the jenkins master is restarted.
      We tried connecting the slave to another jenkins master (we copied all data from our production jenkins so we got an exact clone) and the slave has no problem connecting
      to the staging jenkins master. We have seen this issue on that server as well.
      The only way for us to resolve this is to restart the jenkins service on the jenkins master. As it starts all jenkins slaves connect automatically and work again.
      This isn't a good solution as we have to abort all jobs in order to restart the service. And for a production server that leads to many (very) unhappy users.

      jenkins-slave.err.log: From launching the java web start until the problem occur. It just hangs there, doing nothing.
      jul 04, 2017 1:46:24 EM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: jenkins_slave
      jul 04, 2017 1:46:24 EM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      jul 04, 2017 1:46:24 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://jenkins_master.company.domain/, http://jenkins_master.company.domain:8080/
      jul 04, 2017 1:46:25 EM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, JNLP-connect, Ping, JNLP2-connect]
      jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
      Agent address: jenkins_master.company.domain
      Agent port: 49187
      Identity: 3d:69:2e:de:e9:84:8b:2b:fd:7b:ad:8c:00:ea:cb:32
      jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins_master.company.domain:49187
      jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect

       

      Output from "java -jar slave.jar -jnlpUrl http://jenkins_master.company.domain/computer/jenkins_slave/slave-agent.jnlp -secret very_secret_hex_string":
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: jenkins_slave
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://jenkins_master.company.domain/
      jul 04, 2017 2:10:18 EM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, JNLP-connect, Ping, JNLP2-connect]
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
      Agent address: jenkins_master.company.domain
      Agent port: 49187
      Identity: 3d:69:2e:de:e9:84:8b:2b:fd:7b:ad:8c:00:ea:cb:32
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins_master.company.domain:49187
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect

      tcpdump from master side over the course of 5 minutes:
      root@jenkins_master:~# date; tcpdump -nnvvv src slave_IP and dst port 49187 -s0 -vv -X -c 1000; date
      Tue Jul 4 13:51:00 CEST 2017
      tcpdump: listening on ens5f0, link-type EN10MB (Ethernet), capture size 262144 bytes
      13:51:26.751986 IP (tos 0x0, ttl 128, id 24431, offset 0, flags [DF], proto TCP (6), length 52)
      slave_IP.58214 > master_IP.49187: Flags [S], cksum 0xec23 (correct), seq 4123494006, win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
      0x0000: 4500 0034 5f6f 4000 8006 5831 0a21 14a6 E..4_o@...X1.!..
      0x0010: 0a21 1a3c e366 c023 f5c7 8676 0000 0000 .!.<.f.#...v....
      0x0020: 8002 2000 ec23 0000 0204 05b4 0103 0308 .....#..........
      0x0030: 0101 0402 ....
      13:51:26.752189 IP (tos 0x0, ttl 128, id 24432, offset 0, flags [DF], proto TCP (6), length 40)
      slave_IP.58214 > master_IP.49187: Flags [.], cksum 0xa8dd (correct), seq 4123494007, ack 2362185277, win 256, length 0
      0x0000: 4500 0028 5f70 4000 8006 583c 0a21 14a6 E..(_p@...X<.!..
      0x0010: 0a21 1a3c e366 c023 f5c7 8677 8ccc 163d .!.<.f.#...w...=
      0x0020: 5010 0100 a8dd 0000 0000 0000 0000 P.............
      13:51:26.803856 IP (tos 0x0, ttl 128, id 24433, offset 0, flags [DF], proto TCP (6), length 64)
      slave_IP.58214 > master_IP.49187: Flags [P.], cksum 0xc27a (correct), seq 0:24, ack 1, win 256, length 24
      0x0000: 4500 0040 5f71 4000 8006 5823 0a21 14a6 E..@_q@...X#.!..
      0x0010: 0a21 1a3c e366 c023 f5c7 8677 8ccc 163d .!.<.f.#...w...=
      0x0020: 5018 0100 c27a 0000 0016 5072 6f74 6f63 P....z....Protoc
      0x0030: 6f6c 3a4a 4e4c 5034 2d63 6f6e 6e65 6374 ol:JNLP4-connect
      13:51:26.804301 IP (tos 0x0, ttl 128, id 24434, offset 0, flags [DF], proto TCP (6), length 45)
      slave_IP.58214 > master_IP.49187: Flags [P.], cksum 0x1c72 (correct), seq 24:29, ack 1, win 256, length 5
      0x0000: 4500 002d 5f72 4000 8006 5835 0a21 14a6 E..-_r@...X5.!..
      0x0010: 0a21 1a3c e366 c023 f5c7 868f 8ccc 163d .!.<.f.#.......=
      0x0020: 5018 0100 1c72 0000 0003 4143 4b00 P....r....ACK.
      4 packets captured
      Tue Jul 4 13:55:57 CEST 2017

      So it seems that the slave can try to connect, but the master refuses to respond.

      We have tried asking google for help, but without luck

      We run the slave as a windows service, but we also see the same issue via Java Web Start, and the command line suggested by jenkins
      (java -jar slave.jar -jnlpUrl http://jenkins_master.company.domain/computer/jenkins_slave/slave-agent.jnlp -secret very_secret_hex_string)

      Since we see the same issue when starting the slave as a windows service, java web start and commandline we're happy to support you on the way that is easiest to troubleshoot.

       

      We're happy to provide additional information if need be.

        Attachments

          Activity

          Hide
          ronen_tef Ronen Bar added a comment -

          Any progress/estimation for a fix?

          Show
          ronen_tef Ronen Bar added a comment - Any progress/estimation for a fix?
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          Ronen Bar an agent log would be useful just to have a full snapshots. If this is "java.io.IOException: An established connection was aborted by the software in your host machine" again, likely it's a whatever issue in your system.

          I cannot provide any ETA until the bug is triaged at least. But I would not expect the fix in October even if the issue is confirmed (unless it's a quick fix).

          Show
          oleg_nenashev Oleg Nenashev added a comment - Ronen Bar an agent log would be useful just to have a full snapshots. If this is "java.io.IOException: An established connection was aborted by the software in your host machine" again, likely it's a whatever issue in your system. I cannot provide any ETA until the bug is triaged at least. But I would not expect the fix in October even if the issue is confirmed (unless it's a quick fix).
          Hide
          ronen_tef Ronen Bar added a comment -

          Oleg Nenashev as already commented, I can't provide any agent log as the agent is not even started therfore a log is not created.

          "...again, likely it's a whatever issue in your system...." in this case, how could it be that the same is not occur with ver. 2.46.1 ? just once upgrading to ver. 2.78.

          The behavier should be consistent I assume.

          Show
          ronen_tef Ronen Bar added a comment - Oleg Nenashev as already commented, I can't provide any agent log as the agent is not even started therfore a log is not created. "...again, likely it's a whatever issue in your system...." in this case, how could it be that the same is not occur with ver. 2.46.1 ? just once upgrading to ver. 2.78. The behavier should be consistent I assume.
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          Ronen Bar Currently I cannot  confirm you have the same issue as Robin Ekerhag. If you are sure it's the same, you should be able to provide agent connection stdout/stderr at least (from either slave.jar STDOUT or wrapper logs).

           

          Show
          oleg_nenashev Oleg Nenashev added a comment - Ronen Bar Currently I cannot  confirm you have the same issue as Robin Ekerhag . If you are sure it's the same, you should be able to provide agent connection stdout/stderr at least (from either slave.jar STDOUT or wrapper logs).  
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Show
          oleg_nenashev Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

            People

            • Assignee:
              Unassigned
              Reporter:
              ekerhag Robin Ekerhag
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: