Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Component/s: ssh-agent-plugin
    • Labels:
    • Environment:
      Jenkins 2.46.2
      Slave.jar version: 3.7
      SSH Agent plugin 1.15
    • Similar Issues:

      Description

      Randomly getting Agent diconnect with this output:

       java.nio.channels.ClosedChannelException 19:33:02 at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
      19:33:02 at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)
      19:33:02 at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)
      19:33:02 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      19:33:02 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      19:33:02 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      19:33:02 at java.lang.Thread.run(Unknown Source)
      19:33:02 Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from wr2czc42446kf.jdnet.deere.com/172.23.213.39:59664' is disconnected.
      19:33:02 at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192)
      19:33:02 at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257)
      19:33:02 at com.sun.proxy.$Proxy74.isAlive(Unknown Source)
      19:33:02 at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1043)
      19:33:02 at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1035)
      19:33:02 at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
      19:33:02 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
      19:33:02 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
      19:33:02 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
      19:33:02 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
      19:33:02 at hudson.model.Build$BuildExecution.build(Build.java:206)
      19:33:02 at hudson.model.Build$BuildExecution.doRun(Build.java:163)
      19:33:02 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
      19:33:02 at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:136)
      19:33:02 at hudson.model.Run.execute(Run.java:1728)
      19:33:02 at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73)
      19:33:02 at hudson.model.ResourceController.execute(ResourceController.java:98)
      19:33:02 at hudson.model.Executor.run(Executor.java:405)

        Attachments

          Activity

          Hide
          tomahawk1187 Anargyros Tomaras added a comment -

          I have the exact same problem. Jenkins master on linux, Slaves run on Windows 1803 Containers. Any ideas plz?

          Show
          tomahawk1187 Anargyros Tomaras added a comment - I have the exact same problem. Jenkins master on linux, Slaves run on Windows 1803 Containers. Any ideas plz?
          Hide
          marlowa Andrew Marlow added a comment -

          I have exactly the same problem. Jenkins version 2.109 on Windows 10 with linux slaves running RHEL5 and RHEL7. It strikes every few days and seems to me to be to do with transient network errors. I think that jenkins needs to be more robust in the face of transient network errors.

          I have seen similar random build problem due to subversion flaking out over transient network errors. Subversion has been changed recently to be more robust in this area and things are noticably better. I think something similar is needed for jenkins.

           

          Show
          marlowa Andrew Marlow added a comment - I have exactly the same problem. Jenkins version 2.109 on Windows 10 with linux slaves running RHEL5 and RHEL7. It strikes every few days and seems to me to be to do with transient network errors. I think that jenkins needs to be more robust in the face of transient network errors. I have seen similar random build problem due to subversion flaking out over transient network errors. Subversion has been changed recently to be more robust in this area and things are noticably better. I think something similar is needed for jenkins.  
          Hide
          jamesfairweather James Fairweather added a comment -

          I managed to resolve this locally.  I looked at the server logs and saw an entry about an out of memory exception when the new agent was trying to connect further research into the problem indicated that Java was not able to allocate a new thread because I was using too much heap memory.  My Jenkins startup parameters had this:

          -Xrs -Xmx1536m

          And there was a comment in the jenkins.xml file indicated the parameters were set that way because they had seen a lot of out of memory exceptions.  But those parameters were the cause of the problem.

          I replaced those two parameters with:

          -Xms256m -Xmx512m

          And now I have a master with 7 build agents connected and the system is stable.

          Of course this specific solution may not solve other people's problems because in our case it was being caused by a previous maintainer not really understanding the implications of giving Java so much heap space.  But the advice about examining the Jenkins server logs stands for everyone.

          p.s. we would also get random disconnects from previously-connected agents but that turned out to be problem with the Power Management scheme on the agent.  Be sure to disable all power-saving options so the agent is always running.

           

          Show
          jamesfairweather James Fairweather added a comment - I managed to resolve this locally.  I looked at the server logs and saw an entry about an out of memory exception when the new agent was trying to connect further research into the problem indicated that Java was not able to allocate a new thread because I was using too much heap memory.  My Jenkins startup parameters had this: -Xrs -Xmx1536m And there was a comment in the jenkins.xml file indicated the parameters were set that way because they had seen a lot of out of memory exceptions.  But those parameters were the cause of the problem. I replaced those two parameters with: -Xms256m -Xmx512m And now I have a master with 7 build agents connected and the system is stable. Of course this specific solution may not solve other people's problems because in our case it was being caused by a previous maintainer not really understanding the implications of giving Java so much heap space.  But the advice about examining the Jenkins server logs stands for everyone. p.s. we would also get random disconnects from previously-connected agents but that turned out to be problem with the Power Management scheme on the agent.  Be sure to disable all power-saving options so the agent is always running.  
          Hide
          jthompson Jeff Thompson added a comment -

          Several of these issues involve similar reports but possibly very different causes. Frequently the error indicates that the channel is closed but provides no indication as to how or why that occurred. Commonly remoting issues involve something in the networking or system environment terminating the connection from outside the process. The trick can be to determine what is doing that. In one instance (JENKINS-52922), Nush Ahmd discovered that setting hudson.slaves.ChannelPinger.pingIntervalSeconds kept the channel from getting disconnected. Or as Fabian Sörensson noted in JENKINS-48895, fiddling with Windows sleep / hibernate options. Or various timeouts.

          One thing that can help is to increase agent or master logging output. You can read about it here: https://github.com/jenkinsci/remoting/blob/master/docs/logging.md . In summary, if you add a java.util.logging properties file and then reference it via the `-loggingConfig` parameter to the agent. For example something like this: `-loggingConfig jenkins-logging.properties`.

          Without further information it is difficult to diagnose anything from this side. Frequently the error is environmental.

          Show
          jthompson Jeff Thompson added a comment - Several of these issues involve similar reports but possibly very different causes. Frequently the error indicates that the channel is closed but provides no indication as to how or why that occurred. Commonly remoting issues involve something in the networking or system environment terminating the connection from outside the process. The trick can be to determine what is doing that. In one instance ( JENKINS-52922 ), Nush Ahmd discovered that setting hudson.slaves.ChannelPinger.pingIntervalSeconds kept the channel from getting disconnected. Or as  Fabian Sörensson noted in JENKINS-48895 , fiddling with Windows sleep / hibernate options. Or various timeouts. One thing that can help is to increase agent or master logging output. You can read about it here: https://github.com/jenkinsci/remoting/blob/master/docs/logging.md  . In summary, if you add a java.util.logging properties file and then reference it via the `-loggingConfig` parameter to the agent. For example something like this: `-loggingConfig jenkins-logging.properties`. Without further information it is difficult to diagnose anything from this side. Frequently the error is environmental.
          Hide
          jthompson Jeff Thompson added a comment -

          Closing for lack of sufficient diagnostics and information to reproduce after no response for quite a while.

          Show
          jthompson Jeff Thompson added a comment - Closing for lack of sufficient diagnostics and information to reproduce after no response for quite a while.

            People

            • Assignee:
              Unassigned
              Reporter:
              jdtester Jens Fiedelak
            • Votes:
              14 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: