Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-30823

Exception in thread "Channel reader thread: <node-name>" java.lang.OutOfMemoryError: Java heap space

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • remoting
    • Jenkins 1.609.1 LTS. Attaching support plugin bundle with additional details.

      Two days ago we encountered an issue where all builds attempted on two of our nodes failed. Console output for each failed build was the following:

      ERROR: SEVERE ERROR occurs 
      org.jenkinsci.lib.envinject.EnvInjectException: hudson.remoting.ChannelClosedException: channel is already closed 
      at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:75)
      at org.jenkinsci.plugins.envinject.EnvInjectListener.loadEnvironmentVariablesNode(EnvInjectListener.java:81) 
      at org.jenkinsci.plugins.envinject.EnvInjectListener.setUpEnvironment(EnvInjectListener.java:39) 
      at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:574) 
      at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494) 
      at hudson.model.Run.execute(Run.java:1741) 
      at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 
      at hudson.model.ResourceController.execute(ResourceController.java:98) 
      at hudson.model.Executor.run(Executor.java:374) 
      Caused by: hudson.remoting.ChannelClosedException: channel is already closed 
      at hudson.remoting.Channel.send(Channel.java:550) 
      at hudson.remoting.Request.call(Request.java:129) 
      at hudson.remoting.Channel.call(Channel.java:752) 
      at hudson.FilePath.act(FilePath.java:1073) 
      at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:44)
      ... 8 more 
      Caused by: java.io.IOException 
      at hudson.remoting.Channel.close(Channel.java:1109) 
      at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118) 
      at hudson.remoting.PingThread.ping(PingThread.java:126) 
      at hudson.remoting.PingThread.run(PingThread.java:85) 
      Caused by: java.util.concurrent.TimeoutException: Ping started at 1444039585493 hasn't completed by 1444039825494 
      ... 2 more
      

      These are windows slaves (JNLP) which run the slave jar as a service. We attempted to restart the slave services first, then rebooted the slaves and ultimately has to restart the Jenkins master to recover from the problem.

      Our after action root cause analysis turned up the following in the Jenkins masters's log file:

      Exception in thread "Channel reader thread: DC-JENWIN-001" java.lang.OutOfMemoryError: Java heap space
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
      Exception in thread "Channel reader thread: DC-JENWIN-003" java.lang.OutOfMemoryError: Java heap space
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
      

      These entries are remarkable in that they don't have any date-time stamps. It seems as if they were not emitted by the normal logger. None the less by looking at the log entries immediately preceding and following them we were able to localize the entries to within a minute accuracy and they are clearly the cause of the problem.

      Moreover, after these two entires occurred, entries similar to the following appears in the master log a regular intervals:

      Oct 05, 2015 10:08:54 AM com.cloudbees.jenkins.support.AsyncResultCache run
      INFO: Could not retrieve metrics data from DC-JENWIN-001 for caching
      java.util.concurrent.TimeoutException
      at hudson.remoting.Request$1.get(Request.java:272)
      at hudson.remoting.Request$1.get(Request.java:206)
      at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      at com.cloudbees.jenkins.support.AsyncResultCache.run(AsyncResultCache.java:95)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      

      These errors appeared for both of and only the affected slave nodes for retrieving "metrics" retrieving the "java version", retrieving the "environment", and all the various other statuses.

      Bottom Line

      Connectivity to these slaves suffered a critical and apparently unrecoverable failure. This failure originated on the the Master side of the connection due apparently to a heap space issue. Yet the master did not properly recognizing the nodes as off line and continued to send builds to them. Given that jobs tend to be "sticky" to a certain slave as long as it is available, this effectively rendered several critical jobs un-buildable.

      Recommended Resolution.

      Though exact reproduction of this problem may be impossible, it is hopped that some steps can be taken to enable Jenkins to correctly identify this problem and take corrective or at least mitigating action, such as re-initializing whatever thread/component was inoperable or marking the node as off-line.

            Unassigned Unassigned
            kbaltrinic Kenneth Baltrinic
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: