Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-57119

maven-plugin random socket leak, leading to threads leak on slave and master

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Component/s: maven-plugin
    • Environment:
    • Similar Issues:

      Description

      I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

       

      "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
              - locked <0x0000000607135d08> (a [B)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
              - locked <0x0000000607062548> (a java.io.BufferedInputStream)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
      

        

      After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

      I found no error in the master logs, slave logs or from any job that would relate to this.

      After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

       

      "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
         java.lang.Thread.State: RUNNABLE
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
              at java.net.SocketInputStream.read(SocketInputStream.java:171)
              at java.net.SocketInputStream.read(SocketInputStream.java:141)
              at java.io.FilterInputStream.read(FilterInputStream.java:133)
              at java.io.FilterInputStream.read(FilterInputStream.java:107)
              at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
      

       

       

      When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

       

      tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
      

       

      This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured a tcpdump and noticed the following patterns:

      Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

       

      RST without leak (slave: 34531 | maven-interceptor: 47034):

      RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

       

      This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation in the StreamCopyThread never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

      Adding socket timeout:

      I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

      Not a good option here as all the users of the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

      Enabling socket keepalive:

      Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74

      I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

       

      ~# ss -o | grep FIN
      tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)
      
      ~# sysctl -a | grep keepalive
      net.ipv4.tcp_keepalive_intvl = 75
      net.ipv4.tcp_keepalive_probes = 9
      net.ipv4.tcp_keepalive_time = 7200

       

       Here's a PR with the change: https://github.com/jenkinsci/maven-plugin/pull/126

       

        Attachments

        1. normal-no-leak.png
          normal-no-leak.png
          116 kB
        2. rst-leak.png
          rst-leak.png
          145 kB
        3. rst-no-leak.png
          rst-no-leak.png
          148 kB

          Activity

          glavoie Gabriel Lavoie created issue -
          glavoie Gabriel Lavoie made changes -
          Field Original Value New Value
          Attachment normal-no-leak.png [ 46839 ]
          Attachment rst-leak.png [ 46840 ]
          Attachment rst-no-leak.png [ 46841 ]
          glavoie Gabriel Lavoie made changes -
          Description I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern (no leak): 

          <Will add trace>

          Some of those (no leak):

          <Will add trace>

          Leading to a leak:

          <Will add trace>

           

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           

           
          I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern (no leak): 

            !normal-no-leak.png|thumbnail!

          Some of those (no leak):

          !rst-no-leak.png|thumbnail!

          Leading to a leak:

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          glavoie Gabriel Lavoie made changes -
          Description I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern (no leak): 

            !normal-no-leak.png|thumbnail!

          Some of those (no leak):

          !rst-no-leak.png|thumbnail!

          Leading to a leak:

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          glavoie Gabriel Lavoie made changes -
          Description I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.x_-_Cog_Saucesomeness/3.3.x_-_With_Cog, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          glavoie Gabriel Lavoie made changes -
          Description I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state willingly, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          glavoie Gabriel Lavoie made changes -
          Description I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state willingly, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          glavoie Gabriel Lavoie made changes -
          Description I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured some tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as users of the InputStream that wrap the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured a tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation in the StreamCopyThread never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as all the users of the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          glavoie Gabriel Lavoie made changes -
          Description I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
           

           

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured a tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation in the StreamCopyThread never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as all the users of the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
            

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          I found no error in the master logs, slave logs or from any job that would relate to this.

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured a tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation in the StreamCopyThread never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as all the users of the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          glavoie Gabriel Lavoie made changes -
          Description I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
            

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          I found no error in the master logs, slave logs or from any job that would relate to this.

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured a tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation in the StreamCopyThread never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as all the users of the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/125]

           
          I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

           
          {code:java}
          "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
                  - locked <0x0000000607135d08> (a [B)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                  - locked <0x0000000607062548> (a java.io.BufferedInputStream)
                  at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
                  at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
                  at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
                  at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          {code}
            

          After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

          I found no error in the master logs, slave logs or from any job that would relate to this.

          After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

           
          {code:java}
          "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                  at java.net.SocketInputStream.read(SocketInputStream.java:171)
                  at java.net.SocketInputStream.read(SocketInputStream.java:141)
                  at java.io.FilterInputStream.read(FilterInputStream.java:133)
                  at java.io.FilterInputStream.read(FilterInputStream.java:107)
                  at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
          {code}
           

           

          When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

           
          {code:java}
          tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
          {code}
           

          This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured a tcpdump and noticed the following patterns:

          Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

            !normal-no-leak.png|thumbnail!

          RST without leak (slave: 34531 | maven-interceptor: 47034):

          !rst-no-leak.png|thumbnail!

          RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

          !rst-leak.png|thumbnail!  

          This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation in the StreamCopyThread never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

          +Adding socket timeout:+

          I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

          Not a good option here as all the users of the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

          +Enabling socket keepalive:+

          Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: [https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74]

          I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

           
          {code:java}
          ~# ss -o | grep FIN
          tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

          ~# sysctl -a | grep keepalive
          net.ipv4.tcp_keepalive_intvl = 75
          net.ipv4.tcp_keepalive_probes = 9
          net.ipv4.tcp_keepalive_time = 7200{code}
           

           Here's a PR with the change: [https://github.com/jenkinsci/maven-plugin/pull/126]

           
          jglick Jesse Glick made changes -
          Assignee Gabriel Lavoie [ glavoie ]
          jglick Jesse Glick made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          jglick Jesse Glick made changes -
          Status In Progress [ 3 ] In Review [ 10005 ]
          Hide
          glavoie Gabriel Lavoie added a comment -

          Adding more information about TCP KeepAlive: https://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/

          From HOWTO:

          Keepalive is non-invasive, and in most cases, if you're in doubt, you can turn it on without the risk of doing something wrong. But do remember that it generates extra network traffic, which can have an impact on routers and firewalls.

          In short, use your brain and be careful.

          In the next section we will distinguish between the two target tasks for keepalive:

          • Checking for dead peers
          • Preventing disconnection due to network inactivity

           

          To add some explanation on the configuration, when keep-alive is enabled on a socket, the Linux network stack will start sending empty PSH packets after 2 hours of inactivity (net.ipv4.tcp_keepalive_time) at a regular interval (net.ipv4.tcp_keepalive_intvl) expecting to receive an ACK. If the peer doesn't answer for a specific number of attempts (net.ipv4.tcp_keepalive_probes) or a RST is received, the network stack will know the connection is dead and it will report it to the application (IOException). 

          We've been testing this change for a few weeks now and we don't see any adverse effect as those threads seems to be only linked to the slave, associated with jobs that completed successfully most of the time. 

          Show
          glavoie Gabriel Lavoie added a comment - Adding more information about TCP KeepAlive:  https://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/ From HOWTO: Keepalive is non-invasive, and in most cases, if you're in doubt, you can turn it on without the risk of doing something wrong. But do remember that it generates extra network traffic, which can have an impact on routers and firewalls. In short, use your brain and be careful. In the next section we will distinguish between the two target tasks for keepalive: Checking for dead peers Preventing disconnection due to network inactivity   To add some explanation on the configuration, when keep-alive is enabled on a socket, the Linux network stack will start sending empty PSH packets after 2 hours of inactivity (net.ipv4.tcp_keepalive_time) at a regular interval (net.ipv4.tcp_keepalive_intvl) expecting to receive an ACK. If the peer doesn't answer for a specific number of attempts (net.ipv4.tcp_keepalive_probes) or a RST is received, the network stack will know the connection is dead and it will report it to the application (IOException).  We've been testing this change for a few weeks now and we don't see any adverse effect as those threads seems to be only linked to the slave, associated with jobs that completed successfully most of the time. 
          olamy Olivier Lamy made changes -
          Labels maven-plugin-3.3
          olamy Olivier Lamy made changes -
          Status In Review [ 10005 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          olamy Olivier Lamy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              glavoie Gabriel Lavoie
              Reporter:
              glavoie Gabriel Lavoie
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: