Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45219

Remoting should terminate() channel after a timeout even if it does not hear from the remote side

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Minor
    • Resolution: Duplicate
    • Component/s: remoting
    • Labels:
      None
    • Similar Issues:

      Description

      Currently the channel termination logic depends on the exchange of CloseCommand's between one side and another... sideA and sideB

      1) sideA requests the channel close

      2) CloseCommand goes to sideB

      3) Transport#commandReceiver() fails to invoke the task due to any reason (deadlock, overload, thread death, etc.) and does not send the CloseCommand back

      4) channel.terminate(new OrderlyShutdown(createdAt)) does not get invoked on sideA

      5) If the channel is operational && there is no PingThread, channel.terminate() will be never invoked again on sideA

      6) channel on sideA never closes the Receiver, so Channel#inClosed stays null

      7) If there are pending Request#calls() operations, they may inifinitely hang in this cycle: 

      while(response==null && !channel.isInClosed())
        // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
        // but in production I've observed that in rare occasion it can block forever, even after a channel
        // is gone. So be defensive against that.
        wait(30*1000);

       

      If we set a timeout for Channel termination on close(), it may help to forcefully terminate the channel when sideB does not send the command back after a timeout (e.g. 1 minute)

        Attachments

          Issue Links

            Activity

            oleg_nenashev Oleg Nenashev created issue -
            oleg_nenashev Oleg Nenashev made changes -
            Field Original Value New Value
            Status Open [ 1 ] In Progress [ 3 ]
            oleg_nenashev Oleg Nenashev made changes -
            Epic Link JENKINS-38833 [ 175240 ]
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            I think it's just JENKINS-44785 now. JENKINS-45294 and JENKINS-45023 should have improved the situation a lot, but we still need timeouts to break this wait() cycle

            Show
            oleg_nenashev Oleg Nenashev added a comment - I think it's just JENKINS-44785 now. JENKINS-45294 and JENKINS-45023 should have improved the situation a lot, but we still need timeouts to break this wait() cycle
            oleg_nenashev Oleg Nenashev made changes -
            Link This issue duplicates JENKINS-44785 [ JENKINS-44785 ]
            cloudbees CloudBees Inc. made changes -
            Remote Link This issue links to "CloudBees Internal OSS-1784 (Web Link)" [ 18574 ]
            oleg_nenashev Oleg Nenashev made changes -
            Status In Progress [ 3 ] Resolved [ 5 ]
            Resolution Duplicate [ 3 ]

              People

              • Assignee:
                Unassigned
                Reporter:
                oleg_nenashev Oleg Nenashev
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: