Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-24155

Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: core, remoting
    • Labels:
    • Environment:
      Windows 7, Windows Server 2008
    • Similar Issues:

      Description

      I am running Jenkins 1.570.

      Occasionally out of the blue, a large chunk of my jenkins slaves will go offline and most importantly stay offline until Jenkins is rebooted. All of the slaves that go offline this way say the following as their reason why:

      The current peer is reconnecting.

      If I look in my Jenkins logs, I see this for some of my slaves that remain online:

      Aug 07, 2014 11:13:07 AM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run
      Accepted connection #2018 from /172.16.100.79:51299
      Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
      TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: dev-build-03 is already connected to this master. Rejecting this connection.
      Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
      TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: Unrecognized name: dev-build-03

      The logs are flooded with all of that, with another one coming in every second.

      Lastly, there is one slave that is online still that should be offline. That slave is fully shut down, yet jenkins sees it as still fully online. All of the offline slaves are running Jenkins' slave.jar file in headless mode, so I can see the console output. All of them think that on their end they are "Online", but Jenkins itself has them all shut down.

      This bug has been haunting me for quite a while now, and it is killing production for me. I really need to know if there's a fix for this, or at the very least, a version of jenkins I can downgrade to that doesn't have this issue.

      Thank you!

        Attachments

        1. jenkins-slave.0.err.log
          427 kB
        2. log.txt
          220 kB
        3. masterJenkins.log
          370 kB

          Issue Links

            Activity

            Hide
            nelu Nelu Vasilica added a comment -

            Just seen the same issue on Jenkins 1.642.1 Linux master. the fix was to restart tomcat and the windows slaves reconnected automatically.
            Found several instances of: Ping started at xxxxxx hasn't completed by xxxxxxx in the logs.
            Is setting jenkins.slaves.NioChannelSelector.disabled property to true a viable workaround?

            Show
            nelu Nelu Vasilica added a comment - Just seen the same issue on Jenkins 1.642.1 Linux master. the fix was to restart tomcat and the windows slaves reconnected automatically. Found several instances of: Ping started at xxxxxx hasn't completed by xxxxxxx in the logs. Is setting jenkins.slaves.NioChannelSelector.disabled property to true a viable workaround?
            Hide
            cesos Cesos Barbarino added a comment -

            Same issue here. Only this time, my tests never get done. The slaves area always dropping during the tests please halp!!

            Show
            cesos Cesos Barbarino added a comment - Same issue here. Only this time, my tests never get done. The slaves area always dropping during the tests please halp!!
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            I am not sure we can proceed much on this issue. Just to summarize changes related to several reports above...

            • Jenkins 2.50+ introduced runaway process termination in new Windows service termination. It should help with the "is already connected to this master" issues being reported to Windows service agents. See JENKINS-39231
            • Whatever happens in Jenkins after the "OutOfMemory" exception, it belongs to the "undefined behavior" area. Jenkins should ideally switch to the disabled state after it since the impact is not predictable
            • JENKINS-25218 introduced fixes to FifoBuffer handling logic, all fixes are available in 2.60.1

            In order to proceed with this issue, I need somebody to confirm it still happens on 2.60.1 and to provide new diagnostics info.

            Show
            oleg_nenashev Oleg Nenashev added a comment - I am not sure we can proceed much on this issue. Just to summarize changes related to several reports above... Jenkins 2.50+ introduced runaway process termination in new Windows service termination. It should help with the "is already connected to this master" issues being reported to Windows service agents. See JENKINS-39231 Whatever happens in Jenkins after the "OutOfMemory" exception, it belongs to the "undefined behavior" area. Jenkins should ideally switch to the disabled state after it since the impact is not predictable JENKINS-25218 introduced fixes to FifoBuffer handling logic, all fixes are available in 2.60.1 In order to proceed with this issue, I need somebody to confirm it still happens on 2.60.1 and to provide new diagnostics info.
            Hide
            hechel Louis Heche added a comment -

            I'm having what seems to be this issue with Jenkins 2.138.3.

            Every 3-4 days all the slaves node go offline although it seems to have no network problem. They return online once the master has been restarted. 

            In attachment you'll find the logs jenkins-slave.0.err.logmasterJenkins.log

            Show
            hechel Louis Heche added a comment - I'm having what seems to be this issue with Jenkins 2.138.3. Every 3-4 days all the slaves node go offline although it seems to have no network problem. They return online once the master has been restarted.  In attachment you'll find the logs  jenkins-slave.0.err.log masterJenkins.log
            Hide
            whitingjr Jeremy Whiting added a comment -

            Louis Heche Oleg Nenashev Cesos Barbarino

             Can one of you do the following. To help narrow down the possible leak areas it will be useful to capture process memory usage and JVM heap usage. Start your master process as normal. Then start 2 tools on the system and redirect the output to separate files. Both tools have low system resource usage.

             Memory stats can be captured using pidstat. Specifically to capture resident set size.

            $ pidstat -r -p <pid> 8 > /tmp/pidstat-capture.txt

             JVM heap size and GC behavior. Specifically the percentage of reclaimed heap space after a full collection.

            $ jstat -gcutil -t -h12 <pid> 8s > /tmp/jstat-capture.txt

            Please attach the generated files to this issue.

            Show
            whitingjr Jeremy Whiting added a comment - Louis Heche Oleg Nenashev Cesos Barbarino  Can one of you do the following. To help narrow down the possible leak areas it will be useful to capture process memory usage and JVM heap usage. Start your master process as normal. Then start 2 tools on the system and redirect the output to separate files. Both tools have low system resource usage.  Memory stats can be captured using pidstat. Specifically to capture resident set size. $ pidstat -r -p <pid> 8 > /tmp/pidstat-capture.txt  JVM heap size and GC behavior. Specifically the percentage of reclaimed heap space after a full collection. $ jstat -gcutil -t -h12 <pid> 8s > /tmp/jstat-capture.txt Please attach the generated files to this issue.

              People

              • Assignee:
                Unassigned
                Reporter:
                krandino Kevin Randino
              • Votes:
                30 Vote for this issue
                Watchers:
                43 Start watching this issue

                Dates

                • Created:
                  Updated: