Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53569

Remoting deadlock observed after upgrading to 3.26

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Critical
    • Resolution: Fixed
    • Component/s: remoting, swarm-plugin
    • Labels:
      None
    • Environment:
      Server: Jenkins 2.138.1 LTS (Remoting 3.25)
      Client: Swarm Client 3.14 (Remoting 3.26)
    • Similar Issues:
    • Released As:
      Remoting 3.27, Jenkins 2.144

      Description

      After upgrading my Jenkins master and the Swarm Client to the latest stable versions, I am seeing a new deadlock on the Swarm Client side when trying to connect to the master.

      The relevant output from jstack:

      Found one Java-level deadlock:
      =============================
      "pool-1-thread-3":
        waiting to lock monitor 0x0000000000d12970 (object 0x0000000784a00fc0, a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer),
        which is held by "Thread-2"
      "Thread-2":
        waiting for ownable synchronizer 0x0000000784a4ac68, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
        which is held by "pool-1-thread-3"
      
      Java stack information for the threads listed above:
      ===================================================
      "pool-1-thread-3":
              at org.jenkinsci.remoting.protocol.FilterLayer.onRecvRemoved(FilterLayer.java:134)
              - waiting to lock <0x0000000784a00fc0> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextRecv(ProtocolStack.java:929)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:663)
              at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369)
              at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:669)
              at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)
              at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:48)
              at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:283)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:93)
              at hudson.remoting.Engine$1$$Lambda$5/613009671.run(Unknown Source)
              at java.lang.Thread.run(Thread.java:748)
      "Thread-2":
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x0000000784a4ac68> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
              at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.remove(ProtocolStack.java:755)
              at org.jenkinsci.remoting.protocol.FilterLayer.completed(FilterLayer.java:108)
              - locked <0x0000000784a00fc0> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
              at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.complete(ConnectionHeadersFilterLayer.java:363)
              at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:499)
              - locked <0x0000000784a00fc0> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:692)
              at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157)
              at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
              at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
              at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
              at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.connect(JnlpProtocol4Handler.java:179)
              at org.jenkinsci.remoting.engine.JnlpProtocolHandler.connect(JnlpProtocolHandler.java:157)
              at hudson.remoting.Engine.innerRun(Engine.java:573)
              at hudson.remoting.Engine.run(Engine.java:474)
      
      Found 1 deadlock.
      

      After encountering this deadlock, the Swarm Client never finishes connecting to the master. The master is unable to use the Swarm Client as a node when it reaches this hung state.

        Attachments

        1. jstack.txt
          13 kB
        2. swarm-client.log
          191 kB
        3. swarm-client.stdout.log
          3 kB

          Activity

          Hide
          basil Basil Crow added a comment -

          I've attached to the bug full jstack output as well as the standard out of the Swarm Client and FINEST level log files from the Swarm Client. Note that compared to a successful connection, "Connected" is never printed to standard out. Even though the deadlock happens in the Swarm Client, the stack trace implicates Remoting.

          Show
          basil Basil Crow added a comment - I've attached to the bug full jstack output as well as the standard out of the Swarm Client and FINEST level log files from the Swarm Client. Note that compared to a successful connection, "Connected" is never printed to standard out. Even though the deadlock happens in the Swarm Client, the stack trace implicates Remoting.
          Hide
          basil Basil Crow added a comment -

          I looked through the recent commits and didn't find anything remotely related to locking and thread notifications besides JENKINS-51841. Could that change be related to this issue?

          Show
          basil Basil Crow added a comment - I looked through the recent commits and didn't find anything remotely related to locking and thread notifications besides JENKINS-51841 . Could that change be related to this issue?
          Hide
          jthompson Jeff Thompson added a comment -

          Basil Crow it is unlikely that any of the recent changes would have caused the behavior you are seeing. The one you reference shouldn't have caused this as it was a re-factoring or rearrangement to allow the remoting-kafka plugin access to some pieces.

          I don't have any insight into what might be going on in your system. I'm not familiar with any other reports like this. I'll try to take a little deeper look at your report when I get the chance.

          Show
          jthompson Jeff Thompson added a comment - Basil Crow it is unlikely that any of the recent changes would have caused the behavior you are seeing. The one you reference shouldn't have caused this as it was a re-factoring or rearrangement to allow the remoting-kafka plugin access to some pieces. I don't have any insight into what might be going on in your system. I'm not familiar with any other reports like this. I'll try to take a little deeper look at your report when I get the chance.
          Hide
          jthompson Jeff Thompson added a comment -

          I haven't had a chance to examine your report any further, but I ran across something somewhere else and wondered if it might be similar to yours. From what I've read, this issue JENKINS-42187 can possibly appear to cause hangs relating to Docker and swarms. It sounds like your environment might be similar so I thought I pass this along to see if it provided any help to you.  

           

          Show
          jthompson Jeff Thompson added a comment - I haven't had a chance to examine your report any further, but I ran across something somewhere else and wondered if it might be similar to yours. From what I've read, this issue JENKINS-42187  can possibly appear to cause hangs relating to Docker and swarms. It sounds like your environment might be similar so I thought I pass this along to see if it provided any help to you.    
          Hide
          jthompson Jeff Thompson added a comment -

          No, it doesn't look like that Docker issue has anything to do with it. I got a little time to take a look at this and yes, it's a regular old Java threading deadlock. I'm not yet certain of the sequence that is causing this deadlock, or isn't causing it in other cases. I have an idea for a change, which may solve the problem and doesn't seem to cause any other problems that are covered by the automated tests. Unfortunately as usual they don't cover threading, locking, and deadlocking very well.

          Show
          jthompson Jeff Thompson added a comment - No, it doesn't look like that Docker issue has anything to do with it. I got a little time to take a look at this and yes, it's a regular old Java threading deadlock. I'm not yet certain of the sequence that is causing this deadlock, or isn't causing it in other cases. I have an idea for a change, which may solve the problem and doesn't seem to cause any other problems that are covered by the automated tests. Unfortunately as usual they don't cover threading, locking, and deadlocking very well.
          Hide
          jthompson Jeff Thompson added a comment -

          Released Remoting 3.27, which contains a fix to avoid this deadlock. The potential deadlock has been around for a while and wasn't specific to 3.26. Something may have tweaked the timing in some environments that made it occur more. This should go into a weekly release soon.

          Show
          jthompson Jeff Thompson added a comment - Released Remoting 3.27, which contains a fix to avoid this deadlock. The potential deadlock has been around for a while and wasn't specific to 3.26. Something may have tweaked the timing in some environments that made it occur more. This should go into a weekly release soon.
          Hide
          basil Basil Crow added a comment -

          Thank you! I appreciate this.

          Show
          basil Basil Crow added a comment - Thank you! I appreciate this.

            People

            • Assignee:
              jthompson Jeff Thompson
              Reporter:
              basil Basil Crow
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: