Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22758

Jenkins >=1.560 breaks Jenkins slave handling / NIO JNLP related (using swarm plugin)

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Blocker
    • Resolution: Fixed
    • Component/s: core
    • Labels:
    • Environment:
      openjdk-6 on Debian/wheezy
    • Similar Issues:

      Description

      My Jenkins slaves are connected via the swarm plugin (tested with swarm-client-1.10 and swarm-client-1.15), as soon as I upgrade Jenkins master to 1.560 Jenkins jobs very soon die with (quoting Jenkins job output):

      09:17:44 FATAL: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedByInterruptException
      09:17:44 hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedByInterruptException
      09:17:44 at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
      09:17:44 at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
      09:17:44 at hudson.remoting.Request.call(Request.java:174)
      09:17:44 at hudson.remoting.Channel.call(Channel.java:738)
      09:17:44 at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:168)
      09:17:44 at com.sun.proxy.$Proxy57.join(Unknown Source)
      09:17:44 at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:951)
      09:17:44 at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:137)
      09:17:44 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:97)
      09:17:44 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
      09:17:44 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
      09:17:44 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:740)
      09:17:44 at hudson.model.Build$BuildExecution.build(Build.java:198)
      09:17:44 at hudson.model.Build$BuildExecution.doRun(Build.java:159)
      09:17:44 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:519)
      09:17:44 at hudson.model.Run.execute(Run.java:1703)
      09:17:44 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
      09:17:44 at hudson.model.ResourceController.execute(ResourceController.java:88)
      09:17:44 at hudson.model.Executor.run(Executor.java:231)
      09:17:44 Caused by: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedByInterruptException
      09:17:44 at hudson.remoting.Request.abort(Request.java:299)
      09:17:44 at hudson.remoting.Channel.terminate(Channel.java:801)
      09:17:44 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
      09:17:44 Caused by: java.nio.channels.ClosedByInterruptException
      09:17:44 at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:201)
      09:17:44 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:281)
      09:17:44 at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:33)
      09:17:44 at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
      09:17:44 at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
      09:17:44 at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
      09:17:44 at java.io.InputStream.read(InputStream.java:101)
      09:17:44 at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
      09:17:44 at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:77)
      09:17:44 at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2291)
      09:17:44 at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2584)
      09:17:44 at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2594)
      09:17:44 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1317)
      09:17:44 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
      09:17:44 at hudson.remoting.Command.readFrom(Command.java:92)
      09:17:44 at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:70)
      09:17:44 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

      ... and the slaves get disconnected, quoting from jenkins.log:

      Apr 25, 2014 9:40:34 AM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed
      WARNING: Channel reader thread: jenkins-slave6 for + jenkins-slave6 terminated
      java.nio.channels.ClosedChannelException
      at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:131)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:174)
      at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:33)
      at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
      at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
      at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
      at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:85)
      at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2308)
      at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2321)
      at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2816)
      at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1451)
      at java.io.ObjectInputStream.readTypeString(ObjectInputStream.java:1417)
      at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:680)
      at java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:829)
      at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1600)
      at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
      at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
      at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
      at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
      at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1914)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1797)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
      at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
      at hudson.remoting.Command.readFrom(Command.java:92)
      at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:70)
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

      AFAICS this seems to be related to the following Git commit:

      commit d4c74bf35d4d04bf289b020efd29f827458e0e11
      Author: Kohsuke Kawaguchi <kk@kohsuke.org>
      Date: Fri Apr 4 17:39:44 2014 -0700

      Integrated NIO JNLP slaves handling

      ]

      Downgrading to Jenkins v1.559 again works around the issue.

        Attachments

          Issue Links

            Activity

            Hide
            cbos Cees Bos added a comment - - edited

            It looks like not only an issue with swarm plugin.

            Today we did an update to 1.560 (from 1.554).
            We noticed 2 things:

            • We have a regular restart of the JNLP slaves with a Groovy script executed on the slave itself as soon as the executor is free. Each slave has only 1 executed.
              There are long running jobs active on the JNLP saves. So normally it waits till the job is completed, then the next item gets executed.
              But now the restart task got executed right away, and broke the running job.
            • The machine got restarted and then tried to connect again to Jenkins, that fails.
              We have a number of jnlp slaves, but we don't get them connected anymore.
              From the perspective of the slave it shows 'connected', but Jenkins master shows it as still unconnected.
              We have a Jenkins master on Linux and mix of all kind of slaves.

            In Jenkins you can click on 'Log' for a node, then there is:
            JNLP agent connected from /10.1.29.228

            But the UI still shows it as offline.

            Show
            cbos Cees Bos added a comment - - edited It looks like not only an issue with swarm plugin. Today we did an update to 1.560 (from 1.554). We noticed 2 things: We have a regular restart of the JNLP slaves with a Groovy script executed on the slave itself as soon as the executor is free. Each slave has only 1 executed. There are long running jobs active on the JNLP saves. So normally it waits till the job is completed, then the next item gets executed. But now the restart task got executed right away, and broke the running job. The machine got restarted and then tried to connect again to Jenkins, that fails. We have a number of jnlp slaves, but we don't get them connected anymore. From the perspective of the slave it shows 'connected', but Jenkins master shows it as still unconnected. We have a Jenkins master on Linux and mix of all kind of slaves. In Jenkins you can click on 'Log' for a node, then there is: JNLP agent connected from /10.1.29.228 But the UI still shows it as offline.
            Hide
            cbos Cees Bos added a comment -

            Reverting to 1.559 solved the issue for us.

            Version 1.560 also gave memory issues for us. Java heap kept growing.

            After the upgrade it grew up to the max value we have set. Jenkins became slower and slower.

            After downgrade to 1.559 memory usage is now below 1Gb heap.

            Show
            cbos Cees Bos added a comment - Reverting to 1.559 solved the issue for us. Version 1.560 also gave memory issues for us. Java heap kept growing. After the upgrade it grew up to the max value we have set. Jenkins became slower and slower. After downgrade to 1.559 memory usage is now below 1Gb heap.
            Hide
            klou Kurt added a comment - - edited

            We're also seeing the slave connection breakdown problem since 1.560 and we're not using swarm plugin.
            The connection breaks as soons as there is significant load on the slaves (all Slave executors occupied and queue > 0).

            We also have a snapshot build

            Jenkins ver. 1.560-SNAPSHOT (private-04/10/2014 15:41 GMT-jenkins)

            in use to verify the fix of and now to still overcome issue 22560. That version seems not to show the problem.
            From timestamp that build should be somwhere around build 3292 or 3293. Maybe that helps to faster isolate the critical change.

            Additionally we see the problem as soon as "slave.jar" is updated to v 2.39 (the one that comes with official Jenkins 1.560) while keeping the master unchanged (i.e. at above version).

            Our Environment:
            Master: Windows Server 2008R2, 64bit (jdk 7u51) running in a VMWare Machine
            Slaves (>10): Win7, 64bit (jre 7u51) running on physical hardware

            Show
            klou Kurt added a comment - - edited We're also seeing the slave connection breakdown problem since 1.560 and we're not using swarm plugin. The connection breaks as soons as there is significant load on the slaves (all Slave executors occupied and queue > 0). We also have a snapshot build Jenkins ver. 1.560-SNAPSHOT (private-04/10/2014 15:41 GMT-jenkins) in use to verify the fix of and now to still overcome issue 22560. That version seems not to show the problem. From timestamp that build should be somwhere around build 3292 or 3293. Maybe that helps to faster isolate the critical change. Additionally we see the problem as soon as "slave.jar" is updated to v 2.39 (the one that comes with official Jenkins 1.560) while keeping the master unchanged (i.e. at above version). Our Environment: Master: Windows Server 2008R2, 64bit (jdk 7u51) running in a VMWare Machine Slaves (>10): Win7, 64bit (jre 7u51) running on physical hardware
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            The issue is in the remoting library (actually, slave.jar is just a local copy of this lib). I suppose the issue is a blocker

            Show
            oleg_nenashev Oleg Nenashev added a comment - The issue is in the remoting library (actually, slave.jar is just a local copy of this lib). I suppose the issue is a blocker
            Hide
            klou Kurt added a comment -

            For us this problem seems to have improved a lot with Jenkins 1.563 and corresponding slave.jar 2.41 on the slaves.
            It was just a try that we did as JENKINS-22734 seemed to be related and fixed in that version. Somehow we now do not see any more broken slave connections even on high load (on either master or slave).

            What we still need is to disable the Response Time and Clock Difference preventive monitoring on the nodes->configure page.
            This is because we've noticed (even with 1.563) that right after launching Jenkins Master the Response Time (of all slaves) is recorded around 10000 ms (10 secs). Within the first few minutes after starting the Master the Response Time then slowly decreases to acceptable values of << 500 ms. But if preventive monitoring is on, the slaves are taken down due to the huge values at the beginning and before the Response Time decreases. Our LAN infrastructure is 1 GE connections on the slaves and a 10 GE on the Master (virtualized in a VMWare VM) with all machines directly connected to enterprise-grade backbone core switches of our data center. OS ping times are far below 1 ms.

            Show
            klou Kurt added a comment - For us this problem seems to have improved a lot with Jenkins 1.563 and corresponding slave.jar 2.41 on the slaves. It was just a try that we did as JENKINS-22734 seemed to be related and fixed in that version. Somehow we now do not see any more broken slave connections even on high load (on either master or slave). What we still need is to disable the Response Time and Clock Difference preventive monitoring on the nodes->configure page. This is because we've noticed (even with 1.563) that right after launching Jenkins Master the Response Time (of all slaves) is recorded around 10000 ms (10 secs). Within the first few minutes after starting the Master the Response Time then slowly decreases to acceptable values of << 500 ms. But if preventive monitoring is on, the slaves are taken down due to the huge values at the beginning and before the Response Time decreases. Our LAN infrastructure is 1 GE connections on the slaves and a 10 GE on the Master (virtualized in a VMWare VM) with all machines directly connected to enterprise-grade backbone core switches of our data center. OS ping times are far below 1 ms.
            Hide
            kohsuke Kohsuke Kawaguchi added a comment -

            This is fixed in remoting 2.41.

            Show
            kohsuke Kohsuke Kawaguchi added a comment - This is fixed in remoting 2.41.
            Hide
            svvivek Vivekanand SV added a comment -

            @koshuke, I am using jenkins version 1.574 and I get this error after upgrade to 1.574 for windows slaves (windows 8 and windows 2008 server, I also have windows 7 and windows 2012 server but they are working fine after jenkins upgrade). How should I get the fix that you mentioned above ?

            Do I just need to download salve.jar from the node page of the current jenkins and use that ? If yes, I tried that and it didn't work.

            Any ideas ?

            Vivek.

            Show
            svvivek Vivekanand SV added a comment - @koshuke, I am using jenkins version 1.574 and I get this error after upgrade to 1.574 for windows slaves (windows 8 and windows 2008 server, I also have windows 7 and windows 2012 server but they are working fine after jenkins upgrade). How should I get the fix that you mentioned above ? Do I just need to download salve.jar from the node page of the current jenkins and use that ? If yes, I tried that and it didn't work. Any ideas ? Vivek.
            Hide
            svvivek Vivekanand SV added a comment -

            I updated to 1.575 and used the latest salve.jar, things are working fine till now. Will update if there are any issues.

            Vivek.

            Show
            svvivek Vivekanand SV added a comment - I updated to 1.575 and used the latest salve.jar, things are working fine till now. Will update if there are any issues. Vivek.

              People

              • Assignee:
                kohsuke Kohsuke Kawaguchi
                Reporter:
                mika Michael Prokop
              • Votes:
                3 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: