Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60960

Agent-side memory leak (ExportTable retaining some UNIXProcess$ProcessPipeInputStream)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • core, durable-task-plugin
    • None
    • Jenkins 2.204.2
      durable-task 1.33
      Linux, OpenJDK 8

      I came across some OOM errors on some agents which were online for 2-3 weeks.

      Heap dumps showed thousands of of java.lang.UNIXProcess$ProcessPipeInputStream objects retained by the remoting ExportTable.  There was no zombie process on the agent VM, and all these ProcessPipeInputStream were linked to the ProcessBuilder.NullInputStream.INSTANCE, which also suggests they were for processes which had already exited, at least from what I understand of this code:
      http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/tip/src/solaris/classes/java/lang/UNIXProcess.java#l525

      Because these input streams are also BufferedInputStream, each of them retains an 8KB byte array, thus a significant amount of memory when you have thousands of them.

      In the dumps ExportTable$CreatedAt of these entries, the allocationTrace was, unfortunately, not super helpful to identify the root cause (but maybe it's just me looking at stuff I don't know):

      hudson.remoting.ExportTable$Entry.<init>() (ExportTable.java:114)
      hudson.remoting.ExportTable.export() (ExportTable.java:365)
      hudson.remoting.Channel.internalExport() (Channel.java:776)
      hudson.remoting.RemoteInputStream.writeObject() (RemoteInputStream.java:174)
      sun.reflect.NativeMethodAccessorImpl.invoke0() (NativeMethodAccessorImpl.java)
      sun.reflect.NativeMethodAccessorImpl.invoke() (NativeMethodAccessorImpl.java:62)
      sun.reflect.DelegatingMethodAccessorImpl.invoke() (DelegatingMethodAccessorImpl.java:43)
      java.lang.reflect.Method.invoke() (Method.java:498)
      java.io.ObjectStreamClass.invokeWriteObject() (ObjectStreamClass.java:1140)
      java.io.ObjectOutputStream.writeSerialData() (ObjectOutputStream.java:1496)
      java.io.ObjectOutputStream.writeOrdinaryObject() (ObjectOutputStream.java:1432)
      java.io.ObjectOutputStream.writeObject0() (ObjectOutputStream.java:1178)
      java.io.ObjectOutputStream.defaultWriteFields() (ObjectOutputStream.java:1548)
      java.io.ObjectOutputStream.writeSerialData() (ObjectOutputStream.java:1509)
      java.io.ObjectOutputStream.writeOrdinaryObject() (ObjectOutputStream.java:1432)
      java.io.ObjectOutputStream.writeObject0() (ObjectOutputStream.java:1178)
      java.io.ObjectOutputStream.writeObject() (ObjectOutputStream.java:348)
      hudson.remoting.UserRequest._serialize() (UserRequest.java:263)
      hudson.remoting.UserRequest.serialize() (UserRequest.java:272)
      hudson.remoting.UserRequest.perform() (UserRequest.java:222)
      hudson.remoting.UserRequest.perform() (UserRequest.java:54)
      hudson.remoting.Request$2.run() (Request.java:369)
      hudson.remoting.InterceptingExecutorService$1.call() (InterceptingExecutorService.java:72)
      java.util.concurrent.FutureTask.run() (FutureTask.java:266)
      java.util.concurrent.ThreadPoolExecutor.runWorker() (ThreadPoolExecutor.java:1149)
      java.util.concurrent.ThreadPoolExecutor$Worker.run() (ThreadPoolExecutor.java:624)
      java.lang.Thread.run() (Thread.java:748)
      

      So, I've compared some heap dumps over short periods of time on an agent, correlating the delta with the builds activity, and after trimming down a particular Pipeline which was contributing to the leak I came to the conclusion that this is a reliable way to leak two java.lang.UNIXProcess$ProcessPipeInputStream:

      node {
          stage("test") {
              sh '/bin/true'
          }
      }
      

      I've also tried random incantations, like forcing GC on master and agent, but it looks like once these objects have been exported they really won't get unexported. And of course, running a freestyle job with a similar Shell step does not trigger a similar issue (the agent's ExportTable is exactly the same before/after running the job).

      Because it was looking like a durable-task bug at first glance, I've tried to reproduce it from a unit test. So far, I've been plugging this in the BourneShellScriptTest.smokeTest:

      diff --git a/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java b/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
      index 85a8ca9..b16e82a 100644
      --- a/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
      +++ b/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
      @@ -37,15 +37,20 @@ import hudson.Platform;
       import hudson.Proc;
       import hudson.model.Slave;
       import hudson.plugins.sshslaves.SSHLauncher;
      +import hudson.remoting.Channel;
       import hudson.remoting.VirtualChannel;
       import hudson.slaves.DumbSlave;
       import hudson.tasks.Shell;
       import hudson.util.StreamTaskListener;
       import hudson.util.VersionNumber;
      +import jenkins.security.MasterToSlaveCallable;
      +
       import java.io.ByteArrayOutputStream;
       import java.io.File;
       import java.io.IOException;
       import java.io.InputStream;
      +import java.io.PrintWriter;
      +import java.io.StringWriter;
       import java.nio.charset.Charset;
       import java.nio.charset.StandardCharsets;
       import java.time.Instant;
      @@ -217,6 +222,7 @@ public class BourneShellScriptTest {
                       throw new AssertionError(platform);
               }
       
      +        assertNoProcessPipeInputStreamInRemoreExportTable();
               String script = String.format("echo hello world; sleep %s", sleepSeconds);
               Controller c = new BourneShellScript(script).launch(new EnvVars(), ws, launcher, listener);
               awaitCompletion(c);
      @@ -225,7 +231,9 @@ public class BourneShellScriptTest {
               assertEquals(0, c.exitStatus(ws, launcher, listener).intValue());
               assertTrue(baos.toString().contains("hello world"));
               c.cleanup(ws);
      -        assertNoZombies();
      +        //assertNoZombies();
      +        Thread.sleep(5_000); // give it some time to unexport other stuff
      +        assertNoProcessPipeInputStreamInRemoreExportTable();
           }
       
           @Test public void stop() throws Exception {
      @@ -595,4 +603,40 @@ public class BourneShellScriptTest {
                   Thread.sleep(100);
               }
           }
      +
      +    private void assertNoProcessPipeInputStreamInRemoreExportTable()
      +            throws IOException, InterruptedException {
      +        VirtualChannel virtualChannel = launcher.getChannel();
      +        if (virtualChannel instanceof Channel) {
      +            @SuppressWarnings("resource")
      +            Channel channel = (Channel) virtualChannel;
      +            String remoteExportTable = channel.call(new DumpExportTableCallable());
      +            if (remoteExportTable.contains("object=java.lang.UNIXProcess$ProcessPipeInputStream")) {
      +                System.out.println();
      +                System.out.println(">>> local ExportTable:");
      +                StringWriter sw = new StringWriter();
      +                channel.dumpExportTable(new PrintWriter(sw));
      +                System.out.println(sw.toString());
      +                System.out.println();
      +                System.out.println(">>> remote ExportTable:");
      +                System.out.println(remoteExportTable);
      +                System.out.println();
      +                fail("remote ExportTable contains some java.lang.UNIXProcess$ProcessPipeInputStream objects");
      +            }
      +        }
      +    }
      +
      +    private static final class DumpExportTableCallable extends MasterToSlaveCallable<String, IOException> {
      +        private static final long serialVersionUID = 1L;
      +
      +        @Override
      +        public String call() throws IOException {
      +            StringWriter sw = new StringWriter();
      +            PrintWriter pw = new PrintWriter(sw);
      +            Channel channel = getChannelOrFail();
      +            channel.dumpExportTable(pw);
      +            return sw.toString();
      +        }
      +    }
      +
       }
      

      (let me know if it would be helpful to make a PR for this)

      Running this modified smokeTest gives something like the attached output file.

      Comparing the way Launcher is used in durable-task to run the RemoteProc with what is done "elsewhere" (like hudson.tasks.CommandInterpreter) made me suspect the cause was the "reverse" mode for stdout/stderr, in use here:

      https://github.com/jenkinsci/durable-task-plugin/blob/durable-task-1.33/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L231

              LOGGER.log(Level.FINE, "launching {0}", launcherCmd);
              Launcher.ProcStarter ps = launcher.launch().cmds(launcherCmd).envs(escape(envVars)).pwd(ws).quiet(true);
              if (LAUNCH_DIAGNOSTICS) {
                  ps.stdout(listener);
              } else {
                  ps.readStdout().readStderr(); // TODO RemoteLauncher.launch fails to check ps.stdout == NULL_OUTPUT_STREAM, so it creates a useless thread even if you never called stdout(…)
              }
              ps.start();
      

      FTR, the "avoid useless thread" trick had been introduced by jglick long time ago, so this issue is probably nothing new:
      https://github.com/jenkinsci/durable-task-plugin/commit/9c8814e077bfd5d288c2db105f499ef6c21b4e5f

      So I've tried my unit test with org.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true, and this time it works fine.

      And that's it, thanks for reading

      I have no idea how to fix this... I've not been able to find a way to work around the issue in durable-task (without reintroducing the unwanted StreamCopyThread), and I've not looked at what would be required in core to better handle this use case.

            Unassigned Unassigned
            tom_gl Thomas de Grenier de Latour
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: