Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60960

Agent-side memory leak (ExportTable retaining some UNIXProcess$ProcessPipeInputStream)

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: core, durable-task-plugin
    • Labels:
      None
    • Environment:
      Jenkins 2.204.2
      durable-task 1.33
      Linux, OpenJDK 8
    • Similar Issues:

      Description

      I came across some OOM errors on some agents which were online for 2-3 weeks.

      Heap dumps showed thousands of of java.lang.UNIXProcess$ProcessPipeInputStream objects retained by the remoting ExportTable.  There was no zombie process on the agent VM, and all these ProcessPipeInputStream were linked to the ProcessBuilder.NullInputStream.INSTANCE, which also suggests they were for processes which had already exited, at least from what I understand of this code:
      http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/tip/src/solaris/classes/java/lang/UNIXProcess.java#l525

      Because these input streams are also BufferedInputStream, each of them retains an 8KB byte array, thus a significant amount of memory when you have thousands of them.

      In the dumps ExportTable$CreatedAt of these entries, the allocationTrace was, unfortunately, not super helpful to identify the root cause (but maybe it's just me looking at stuff I don't know):

      hudson.remoting.ExportTable$Entry.<init>() (ExportTable.java:114)
      hudson.remoting.ExportTable.export() (ExportTable.java:365)
      hudson.remoting.Channel.internalExport() (Channel.java:776)
      hudson.remoting.RemoteInputStream.writeObject() (RemoteInputStream.java:174)
      sun.reflect.NativeMethodAccessorImpl.invoke0() (NativeMethodAccessorImpl.java)
      sun.reflect.NativeMethodAccessorImpl.invoke() (NativeMethodAccessorImpl.java:62)
      sun.reflect.DelegatingMethodAccessorImpl.invoke() (DelegatingMethodAccessorImpl.java:43)
      java.lang.reflect.Method.invoke() (Method.java:498)
      java.io.ObjectStreamClass.invokeWriteObject() (ObjectStreamClass.java:1140)
      java.io.ObjectOutputStream.writeSerialData() (ObjectOutputStream.java:1496)
      java.io.ObjectOutputStream.writeOrdinaryObject() (ObjectOutputStream.java:1432)
      java.io.ObjectOutputStream.writeObject0() (ObjectOutputStream.java:1178)
      java.io.ObjectOutputStream.defaultWriteFields() (ObjectOutputStream.java:1548)
      java.io.ObjectOutputStream.writeSerialData() (ObjectOutputStream.java:1509)
      java.io.ObjectOutputStream.writeOrdinaryObject() (ObjectOutputStream.java:1432)
      java.io.ObjectOutputStream.writeObject0() (ObjectOutputStream.java:1178)
      java.io.ObjectOutputStream.writeObject() (ObjectOutputStream.java:348)
      hudson.remoting.UserRequest._serialize() (UserRequest.java:263)
      hudson.remoting.UserRequest.serialize() (UserRequest.java:272)
      hudson.remoting.UserRequest.perform() (UserRequest.java:222)
      hudson.remoting.UserRequest.perform() (UserRequest.java:54)
      hudson.remoting.Request$2.run() (Request.java:369)
      hudson.remoting.InterceptingExecutorService$1.call() (InterceptingExecutorService.java:72)
      java.util.concurrent.FutureTask.run() (FutureTask.java:266)
      java.util.concurrent.ThreadPoolExecutor.runWorker() (ThreadPoolExecutor.java:1149)
      java.util.concurrent.ThreadPoolExecutor$Worker.run() (ThreadPoolExecutor.java:624)
      java.lang.Thread.run() (Thread.java:748)
      

      So, I've compared some heap dumps over short periods of time on an agent, correlating the delta with the builds activity, and after trimming down a particular Pipeline which was contributing to the leak I came to the conclusion that this is a reliable way to leak two java.lang.UNIXProcess$ProcessPipeInputStream:

      node {
          stage("test") {
              sh '/bin/true'
          }
      }
      

      I've also tried random incantations, like forcing GC on master and agent, but it looks like once these objects have been exported they really won't get unexported. And of course, running a freestyle job with a similar Shell step does not trigger a similar issue (the agent's ExportTable is exactly the same before/after running the job).

      Because it was looking like a durable-task bug at first glance, I've tried to reproduce it from a unit test. So far, I've been plugging this in the BourneShellScriptTest.smokeTest:

      diff --git a/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java b/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
      index 85a8ca9..b16e82a 100644
      --- a/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
      +++ b/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
      @@ -37,15 +37,20 @@ import hudson.Platform;
       import hudson.Proc;
       import hudson.model.Slave;
       import hudson.plugins.sshslaves.SSHLauncher;
      +import hudson.remoting.Channel;
       import hudson.remoting.VirtualChannel;
       import hudson.slaves.DumbSlave;
       import hudson.tasks.Shell;
       import hudson.util.StreamTaskListener;
       import hudson.util.VersionNumber;
      +import jenkins.security.MasterToSlaveCallable;
      +
       import java.io.ByteArrayOutputStream;
       import java.io.File;
       import java.io.IOException;
       import java.io.InputStream;
      +import java.io.PrintWriter;
      +import java.io.StringWriter;
       import java.nio.charset.Charset;
       import java.nio.charset.StandardCharsets;
       import java.time.Instant;
      @@ -217,6 +222,7 @@ public class BourneShellScriptTest {
                       throw new AssertionError(platform);
               }
       
      +        assertNoProcessPipeInputStreamInRemoreExportTable();
               String script = String.format("echo hello world; sleep %s", sleepSeconds);
               Controller c = new BourneShellScript(script).launch(new EnvVars(), ws, launcher, listener);
               awaitCompletion(c);
      @@ -225,7 +231,9 @@ public class BourneShellScriptTest {
               assertEquals(0, c.exitStatus(ws, launcher, listener).intValue());
               assertTrue(baos.toString().contains("hello world"));
               c.cleanup(ws);
      -        assertNoZombies();
      +        //assertNoZombies();
      +        Thread.sleep(5_000); // give it some time to unexport other stuff
      +        assertNoProcessPipeInputStreamInRemoreExportTable();
           }
       
           @Test public void stop() throws Exception {
      @@ -595,4 +603,40 @@ public class BourneShellScriptTest {
                   Thread.sleep(100);
               }
           }
      +
      +    private void assertNoProcessPipeInputStreamInRemoreExportTable()
      +            throws IOException, InterruptedException {
      +        VirtualChannel virtualChannel = launcher.getChannel();
      +        if (virtualChannel instanceof Channel) {
      +            @SuppressWarnings("resource")
      +            Channel channel = (Channel) virtualChannel;
      +            String remoteExportTable = channel.call(new DumpExportTableCallable());
      +            if (remoteExportTable.contains("object=java.lang.UNIXProcess$ProcessPipeInputStream")) {
      +                System.out.println();
      +                System.out.println(">>> local ExportTable:");
      +                StringWriter sw = new StringWriter();
      +                channel.dumpExportTable(new PrintWriter(sw));
      +                System.out.println(sw.toString());
      +                System.out.println();
      +                System.out.println(">>> remote ExportTable:");
      +                System.out.println(remoteExportTable);
      +                System.out.println();
      +                fail("remote ExportTable contains some java.lang.UNIXProcess$ProcessPipeInputStream objects");
      +            }
      +        }
      +    }
      +
      +    private static final class DumpExportTableCallable extends MasterToSlaveCallable<String, IOException> {
      +        private static final long serialVersionUID = 1L;
      +
      +        @Override
      +        public String call() throws IOException {
      +            StringWriter sw = new StringWriter();
      +            PrintWriter pw = new PrintWriter(sw);
      +            Channel channel = getChannelOrFail();
      +            channel.dumpExportTable(pw);
      +            return sw.toString();
      +        }
      +    }
      +
       }
      

      (let me know if it would be helpful to make a PR for this)

      Running this modified smokeTest gives something like the attached output file.

      Comparing the way Launcher is used in durable-task to run the RemoteProc with what is done "elsewhere" (like hudson.tasks.CommandInterpreter) made me suspect the cause was the "reverse" mode for stdout/stderr, in use here:

      https://github.com/jenkinsci/durable-task-plugin/blob/durable-task-1.33/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L231

              LOGGER.log(Level.FINE, "launching {0}", launcherCmd);
              Launcher.ProcStarter ps = launcher.launch().cmds(launcherCmd).envs(escape(envVars)).pwd(ws).quiet(true);
              if (LAUNCH_DIAGNOSTICS) {
                  ps.stdout(listener);
              } else {
                  ps.readStdout().readStderr(); // TODO RemoteLauncher.launch fails to check ps.stdout == NULL_OUTPUT_STREAM, so it creates a useless thread even if you never called stdout(…)
              }
              ps.start();
      

      FTR, the "avoid useless thread" trick had been introduced by Jesse Glick long time ago, so this issue is probably nothing new:
      https://github.com/jenkinsci/durable-task-plugin/commit/9c8814e077bfd5d288c2db105f499ef6c21b4e5f

      So I've tried my unit test with org.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true, and this time it works fine.

      And that's it, thanks for reading

      I have no idea how to fix this... I've not been able to find a way to work around the issue in durable-task (without reintroducing the unwanted StreamCopyThread), and I've not looked at what would be required in core to better handle this use case.

        Attachments

          Activity

          Hide
          jglick Jesse Glick added a comment -

          Thank you for the detailed investigation! I am also not sure offhand how to fix this, though it smells like a core bug.

          Show
          jglick Jesse Glick added a comment - Thank you for the detailed investigation! I am also not sure offhand how to fix this, though it smells like a core bug.
          Hide
          tom_gl Thomas de Grenier de Latour added a comment -

          Actually, I've found a fix/workaround which could be implemented in durable-task plugin:
          https://github.com/jenkinsci/durable-task-plugin/pull/120
          Apparently, closing the unused streams is enough to release their remote counterpart (more details in the PR).

          Show
          tom_gl Thomas de Grenier de Latour added a comment - Actually, I've found a fix/workaround which could be implemented in durable-task plugin: https://github.com/jenkinsci/durable-task-plugin/pull/120 Apparently, closing the unused streams is enough to release their remote counterpart (more details in the PR).
          Hide
          doubleknd26 Kideok Kim added a comment -

          I suffering same issue right now and hope it will be merge ASAP.

          Show
          doubleknd26 Kideok Kim added a comment - I suffering same issue right now and hope it will be merge ASAP.
          Hide
          kon Kalle Niemitalo added a comment -

          The PR was merged and released in durable-task-1.34, but I don't know whether more changes are needed elsewhere.

          Show
          kon Kalle Niemitalo added a comment - The PR was merged and released in durable-task-1.34 , but I don't know whether more changes are needed elsewhere.
          Hide
          doubleknd26 Kideok Kim added a comment -

          I updated durable-task-1.34 and almost slaves are back to normal from the memory leak. Thanks!

          But there is a few slave still suffered this issue. I appreciate if you let me know any ideas.

          Show
          doubleknd26 Kideok Kim added a comment - I updated durable-task-1.34 and almost slaves are back to normal from the memory leak. Thanks! But there is a few slave still suffered this issue. I appreciate if you let me know any ideas.

            People

            • Assignee:
              Unassigned
              Reporter:
              tom_gl Thomas de Grenier de Latour
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: