Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-24201

All jnlp nodes go offline; require master reboot, sometimes followed by slave reboot.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • remoting
    • Jenkins Native Install on Windows Server 2012, running on VM Sphere.
      Jenkins ver. 1.565.1 (Also, 1.563 and 1.570)

      The Jenkins Master reports that all its JNLP (and all our nodes are such) are offline.

      On the nodes, they report that they are connected. The only way out is to restart the Master. Of the 6 (or so) times this has occurred, 1/2 the time all the slaves need to have their slave process restarted to recover.

      We also see cases where after a restarting Jenkins, it recovers for a short time. Then the problem re-occurs. However, if it's running 10-minutes after a restart, we seem to be fine for 3-4 days.

      We were running on version 565 when this first occurred. We ran fine for 3-months. What changed for us is that we increased the number of nodes. We now have 93 nodes, up from about 50. There was also an increase in the number of jobs.

      We use the vSphere Cloud Plugin. However, we changed one slave to use ssh instead of jnlp. The problem was resolved for this slave, and it is not disconnected when the problem occurs. We did not find the same for a vsphere/jnlp slave where we removed the vsphere configuration. (Well, recreated the slave without vsphere).

      This seems to be similar to JENKINS-24155 JENKINS-24050 JENKINS-22714 JENKINS-22932 JENKINS-23384

      We have examined the VM logs, the network logs and the firewalls. There is no obvious issue.

      I've attached the err.log of one of the incidents. Though it is clear that there is a problem with the slave connections, there is no clear 'cause'.

      I've attached a thread graph of the problem. (Different occurence).

      Normally, Jenkins runs at about 200 threads. AFTER the problem occurs, thread growth occurs linearly until reboot. In the graph, we see there was a problem on Friday night, as activity died off. After a Sat service restart, we see the problem occur again 6-hours later, with corresponding thread growth.

      We suspect that if only the Jenkins Service is restarted, the time to next occurrence is lower, than if we restart the jenkins master host machine.

      Also, we configure some nodes to turn off when idle. However, though we originally suspected this to be a possible cause, we have not found any thing to further collaborate this theory.

      This graph was obtained using the Java Melody Plugin; We have disabled this but the problem has re-occurred. (https://wiki.jenkins-ci.org/display/JENKINS/Monitoring)

      I've attached a thread dump. Again, I cannot see anything amiss here myself, but this is not my area of expertise. The thread dump is not from the same incident at the attached log.

      I've attached the output from one of the jnlp slaves. There is a SEVERE error reported at the slave, though it seems to recover. This is not from the same time as the error log.

      In the error log, I believe that this is the first sign of the issue:

      Aug 07, 2014 7:30:20 PM org.jenkinsci.remoting.nio.NioChannelHub run
      WARNING: Communication problem
      java.io.IOException: An existing connection was forcibly closed by the remote host
      at sun.nio.ch.SocketDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(Unknown Source)

      Just prior to this, a remote slave successfully completes a job. I believe that the ci-25b-linux messages just before this messages is not related, as this slave was displaying problems in the time leading up to the crash.

        1. crash_aug_7_1930.log
          45 kB
        2. jnlp_output.log
          14 kB
        3. thread_growth.png
          thread_growth.png
          116 kB
        4. Thread dump [Jenkins].htm
          265 kB

            kohsuke Kohsuke Kawaguchi
            jnoonan33 James Noonan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: