Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25218

Channel hangs due to the infinite loop in FifoBuffer within the lock

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      I noticed the following "dead lock" that prevents NioChannelHub from serving any channels, which breaks all the slaves.

      NioChannelHub thread is blocked:
      
          "NioChannelHub keys=2 gen=185197: Computer.threadPoolForRemoting [#3]" daemon prio=10 tid=0x00007f872c021800 nid=0x1585 waiting for monitor entry [0x00007f86ce2ba000]
             java.lang.Thread.State: BLOCKED (on object monitor)
      	    at hudson.remoting.Channel.terminate(Channel.java:792)
      	    - waiting to lock <0x00007f874ef76658> (a hudson.remoting.Channel)
      	    at hudson.remoting.Channel$2.terminate(Channel.java:483)
      	    at hudson.remoting.AbstractByteArrayCommandTransport$1.terminate(AbstractByteArrayCommandTransport.java:72)
      	    at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:203)
      	    at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:597)
      	    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
      	    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	    at java.lang.Thread.run(Thread.java:662)
      
      ... because of this guy:
      
          "Computer.threadPoolForRemoting [#216] for mac" daemon prio=10 tid=0x00007f86dc0d6800 nid=0x3f34 in Object.wait() [0x00007f87442f1000]
             java.lang.Thread.State: WAITING (on object monitor)
      	    at java.lang.Object.wait(Native Method)
      	    - waiting on <0x00007f874ef76810> (a org.jenkinsci.remoting.nio.FifoBuffer)
      	    at java.lang.Object.wait(Object.java:485)
      	    at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336)
      	    - locked <0x00007f874ef76810> (a org.jenkinsci.remoting.nio.FifoBuffer)
      	    at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:215)
      	    at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83)
      	    at hudson.remoting.Channel.send(Channel.java:545)
      	    - locked <0x00007f874ef76658> (a hudson.remoting.Channel)
      	    at hudson.remoting.Request$2.run(Request.java:342)
      	    - locked <0x00007f874ef76658> (a hudson.remoting.Channel)
      	    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      	    at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	    at java.lang.Thread.run(Thread.java:662)
      

      Full thread dump is here

        Attachments

          Issue Links

            Activity

            Hide
            stephenconnolly Stephen Connolly added a comment -

            Also seeing similar live-locks with the following thread holding the lock:

            "Computer.threadPoolForRemoting [#173620] : IO ID=7851 : seq#=7850" #2080996 daemon prio=5 os_prio=0 tid=0x0000000088306800 nid=0x15e0 in Object.wait() [0x00000000f209f000]
               java.lang.Thread.State: WAITING (on object monitor)
                at java.lang.Object.wait(Native Method)
                at java.lang.Object.wait(Object.java:502)
                at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336)
                - locked <0x000000044d3408b8> (a org.jenkinsci.remoting.nio.FifoBuffer)
                at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:220)
                at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83)
                at hudson.remoting.Channel.send(Channel.java:576)
                - locked <0x000000044d3407a0> (a hudson.remoting.Channel)
                at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:260)
                at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
                at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
                at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:745)
            

            The NioChannelHub thread has the same stack trace (modulo line numbers)

            Show
            stephenconnolly Stephen Connolly added a comment - Also seeing similar live-locks with the following thread holding the lock: "Computer.threadPoolForRemoting [#173620] : IO ID=7851 : seq#=7850" #2080996 daemon prio=5 os_prio=0 tid=0x0000000088306800 nid=0x15e0 in Object .wait() [0x00000000f209f000] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:502) at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336) - locked <0x000000044d3408b8> (a org.jenkinsci.remoting.nio.FifoBuffer) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:220) at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83) at hudson.remoting.Channel.send(Channel.java:576) - locked <0x000000044d3407a0> (a hudson.remoting.Channel) at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:260) at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) The NioChannelHub thread has the same stack trace (modulo line numbers)
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            I have found the cause of this issue in the FifoBuffer implementation. It happens when....

            1) FifoBuffer is full (sz == limit)
            2) We want to write a new message to the buffer
            3) So we get into this code... https://github.com/jenkinsci/remoting/blob/master/src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java#L342-L348

                        synchronized (lock) {
                            while ((chunk = Math.min(len,writable()))==0) {
                                if (closed)
                                    throw new IOException("closed during write operation");
            
                                lock.wait(100);
            }
            

            4) There is an infinite cycle, which waits till we have some space for writing data, but we will never get space if the buffer is full

            • receive() operation won't happen since we synchronize the code by lock
            • we won't be able to close the channel, because "closed" field modification is guarded by lock as well. Sad panda

            Fixing the issue right now

            Show
            oleg_nenashev Oleg Nenashev added a comment - I have found the cause of this issue in the FifoBuffer implementation. It happens when.... 1) FifoBuffer is full (sz == limit) 2) We want to write a new message to the buffer 3) So we get into this code... https://github.com/jenkinsci/remoting/blob/master/src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java#L342-L348 synchronized (lock) { while ((chunk = Math.min(len,writable()))==0) { if (closed) throw new IOException("closed during write operation"); lock.wait(100); } 4) There is an infinite cycle, which waits till we have some space for writing data, but we will never get space if the buffer is full receive() operation won't happen since we synchronize the code by lock we won't be able to close the channel, because "closed" field modification is guarded by lock as well. Sad panda Fixing the issue right now
            Hide
            oleg_nenashev Oleg Nenashev added a comment - - edited

            Partial fix has been applied in JENKINS-32825, which is released in 2.54. Jenkins cores should get this version starting from 1.651.x.But we're still at risk in particular corner cases

            I'm working on additional fixes around the behavior

            Show
            oleg_nenashev Oleg Nenashev added a comment - - edited Partial fix has been applied in JENKINS-32825 , which is released in 2.54. Jenkins cores should get this version starting from 1.651.x.But we're still at risk in particular corner cases I'm working on additional fixes around the behavior
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Created additional pull request: https://github.com/jenkinsci/remoting/pull/100

            Show
            oleg_nenashev Oleg Nenashev added a comment - Created additional pull request: https://github.com/jenkinsci/remoting/pull/100
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Additional fixes have been integrated towards remoting-2.62.4 and remoting-3.3

            Show
            oleg_nenashev Oleg Nenashev added a comment - Additional fixes have been integrated towards remoting-2.62.4 and remoting-3.3
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Oleg Nenashev
            Path:
            pom.xml
            http://jenkins-ci.org/commit/jenkins/ef588be4f264b5ba285110f472f031e2bd771c71
            Log:
            Update Jenkins remoting to 3.3 (#2671)

            • JENKINS-25218 - Hardening of FifoBuffer operation logic. The change improves the original fix in `remoting-2.54`.
            • JENKINS-39547 - Corrupt agent JAR cache causes agents to malfunction.

            Improvements:

            • JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination.
            • ProxyException now retains any suppressed exceptions.
            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: pom.xml http://jenkins-ci.org/commit/jenkins/ef588be4f264b5ba285110f472f031e2bd771c71 Log: Update Jenkins remoting to 3.3 (#2671) JENKINS-25218 - Hardening of FifoBuffer operation logic. The change improves the original fix in `remoting-2.54`. JENKINS-39547 - Corrupt agent JAR cache causes agents to malfunction. Improvements: JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination. ProxyException now retains any suppressed exceptions.
            Hide
            chris_phillips Chris Phillips added a comment - - edited

            I have just been investigating a problem in our jenkins setup that I think might be related to this. We're using the EC2 plugin and running builds that generate quite large logs (230 MB). At some point during the build, the master loses track of the log and just starts logging the same block of text from the log over and over as long as I let it. The build completes successfully on the slave and nothing bad appears in the Node log in the jenkins UI. But the master continues to fill up the filesystem with the same repeated text forever. I changed the build to log much less and now this isn't happening. We're running 2.46.2. Could this potentially be one of the edge cases? https://issues.jenkins-ci.org/browse/JENKINS-44483

            Show
            chris_phillips Chris Phillips added a comment - - edited I have just been investigating a problem in our jenkins setup that I think might be related to this. We're using the EC2 plugin and running builds that generate quite large logs (230 MB). At some point during the build, the master loses track of the log and just starts logging the same block of text from the log over and over as long as I let it. The build completes successfully on the slave and nothing bad appears in the Node log in the jenkins UI. But the master continues to fill up the filesystem with the same repeated text forever. I changed the build to log much less and now this isn't happening. We're running 2.46.2. Could this potentially be one of the edge cases?  https://issues.jenkins-ci.org/browse/JENKINS-44483
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            As we discussed in JENKINS-44483 , it is likely a Pipeline-specific issue.

            Regarding this ticket, I am going to close it since there were extra patches applied. If somebody still sees it with remoting 3.5+, please let me know

            Show
            oleg_nenashev Oleg Nenashev added a comment - As we discussed in JENKINS-44483 , it is likely a Pipeline-specific issue. Regarding this ticket, I am going to close it since there were extra patches applied. If somebody still sees it with remoting 3.5+, please let me know

              People

              • Assignee:
                oleg_nenashev Oleg Nenashev
                Reporter:
                kohsuke Kohsuke Kawaguchi
              • Votes:
                7 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: