Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-30823

Exception in thread "Channel reader thread: <node-name>" java.lang.OutOfMemoryError: Java heap space

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: remoting
    • Environment:
      Jenkins 1.609.1 LTS. Attaching support plugin bundle with additional details.
    • Similar Issues:

      Description

      Two days ago we encountered an issue where all builds attempted on two of our nodes failed. Console output for each failed build was the following:

      ERROR: SEVERE ERROR occurs 
      org.jenkinsci.lib.envinject.EnvInjectException: hudson.remoting.ChannelClosedException: channel is already closed 
      at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:75)
      at org.jenkinsci.plugins.envinject.EnvInjectListener.loadEnvironmentVariablesNode(EnvInjectListener.java:81) 
      at org.jenkinsci.plugins.envinject.EnvInjectListener.setUpEnvironment(EnvInjectListener.java:39) 
      at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:574) 
      at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494) 
      at hudson.model.Run.execute(Run.java:1741) 
      at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 
      at hudson.model.ResourceController.execute(ResourceController.java:98) 
      at hudson.model.Executor.run(Executor.java:374) 
      Caused by: hudson.remoting.ChannelClosedException: channel is already closed 
      at hudson.remoting.Channel.send(Channel.java:550) 
      at hudson.remoting.Request.call(Request.java:129) 
      at hudson.remoting.Channel.call(Channel.java:752) 
      at hudson.FilePath.act(FilePath.java:1073) 
      at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:44)
      ... 8 more 
      Caused by: java.io.IOException 
      at hudson.remoting.Channel.close(Channel.java:1109) 
      at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118) 
      at hudson.remoting.PingThread.ping(PingThread.java:126) 
      at hudson.remoting.PingThread.run(PingThread.java:85) 
      Caused by: java.util.concurrent.TimeoutException: Ping started at 1444039585493 hasn't completed by 1444039825494 
      ... 2 more
      

      These are windows slaves (JNLP) which run the slave jar as a service. We attempted to restart the slave services first, then rebooted the slaves and ultimately has to restart the Jenkins master to recover from the problem.

      Our after action root cause analysis turned up the following in the Jenkins masters's log file:

      Exception in thread "Channel reader thread: DC-JENWIN-001" java.lang.OutOfMemoryError: Java heap space
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
      Exception in thread "Channel reader thread: DC-JENWIN-003" java.lang.OutOfMemoryError: Java heap space
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
      

      These entries are remarkable in that they don't have any date-time stamps. It seems as if they were not emitted by the normal logger. None the less by looking at the log entries immediately preceding and following them we were able to localize the entries to within a minute accuracy and they are clearly the cause of the problem.

      Moreover, after these two entires occurred, entries similar to the following appears in the master log a regular intervals:

      Oct 05, 2015 10:08:54 AM com.cloudbees.jenkins.support.AsyncResultCache run
      INFO: Could not retrieve metrics data from DC-JENWIN-001 for caching
      java.util.concurrent.TimeoutException
      at hudson.remoting.Request$1.get(Request.java:272)
      at hudson.remoting.Request$1.get(Request.java:206)
      at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      at com.cloudbees.jenkins.support.AsyncResultCache.run(AsyncResultCache.java:95)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      

      These errors appeared for both of and only the affected slave nodes for retrieving "metrics" retrieving the "java version", retrieving the "environment", and all the various other statuses.

      Bottom Line

      Connectivity to these slaves suffered a critical and apparently unrecoverable failure. This failure originated on the the Master side of the connection due apparently to a heap space issue. Yet the master did not properly recognizing the nodes as off line and continued to send builds to them. Given that jobs tend to be "sticky" to a certain slave as long as it is available, this effectively rendered several critical jobs un-buildable.

      Recommended Resolution.

      Though exact reproduction of this problem may be impossible, it is hopped that some steps can be taken to enable Jenkins to correctly identify this problem and take corrective or at least mitigating action, such as re-initializing whatever thread/component was inoperable or marking the node as off-line.

        Attachments

          Issue Links

            Activity

            Hide
            danielbeck Daniel Beck added a comment -

            Assigning to Support Core which seems to be responsible for the data transfer.

            Show
            danielbeck Daniel Beck added a comment - Assigning to Support Core which seems to be responsible for the data transfer.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            I think there are two independent issues.

            • The first one is related to EnvInject + remoting. I've seen similar issues on other instances, hence there is a problem there
            • The second one is related to Support Core plugin

            Both issues are related to the remoting layer, but we have no clue that they are being caused by the same issue

            Show
            oleg_nenashev Oleg Nenashev added a comment - I think there are two independent issues. The first one is related to EnvInject + remoting. I've seen similar issues on other instances, hence there is a problem there The second one is related to Support Core plugin Both issues are related to the remoting layer, but we have no clue that they are being caused by the same issue
            Hide
            kbaltrinic Kenneth Baltrinic added a comment -

            If I am understanding this correctly then, the second issue is really just an annoyance in that it spams the logs but the really issue that is causing the slave inoperability is EnvInject issue. Correct?

            Show
            kbaltrinic Kenneth Baltrinic added a comment - If I am understanding this correctly then, the second issue is really just an annoyance in that it spams the logs but the really issue that is causing the slave inoperability is EnvInject issue. Correct?
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            I would say that both issues are being caused by the slave connection instability after the OOM issue.
            EnvInject behaves correctly, it's just the first call to the corrupted slave in the build.

            Show
            oleg_nenashev Oleg Nenashev added a comment - I would say that both issues are being caused by the slave connection instability after the OOM issue. EnvInject behaves correctly, it's just the first call to the corrupted slave in the build.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            rsandell was going to work on it IIRC

            Show
            oleg_nenashev Oleg Nenashev added a comment - rsandell was going to work on it IIRC
            Hide
            allan_burdajewicz Allan BURDAJEWICZ added a comment -

            Oleg Nenashev any update on this issue ?

            Show
            allan_burdajewicz Allan BURDAJEWICZ added a comment - Oleg Nenashev any update on this issue ?

              People

              • Assignee:
                Unassigned
                Reporter:
                kbaltrinic Kenneth Baltrinic
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: