Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45219

Remoting should terminate() channel after a timeout even if it does not hear from the remote side

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Minor
    • Resolution: Duplicate
    • Component/s: remoting
    • Labels:
      None
    • Similar Issues:

      Description

      Currently the channel termination logic depends on the exchange of CloseCommand's between one side and another... sideA and sideB

      1) sideA requests the channel close

      2) CloseCommand goes to sideB

      3) Transport#commandReceiver() fails to invoke the task due to any reason (deadlock, overload, thread death, etc.) and does not send the CloseCommand back

      4) channel.terminate(new OrderlyShutdown(createdAt)) does not get invoked on sideA

      5) If the channel is operational && there is no PingThread, channel.terminate() will be never invoked again on sideA

      6) channel on sideA never closes the Receiver, so Channel#inClosed stays null

      7) If there are pending Request#calls() operations, they may inifinitely hang in this cycle: 

      while(response==null && !channel.isInClosed())
        // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
        // but in production I've observed that in rare occasion it can block forever, even after a channel
        // is gone. So be defensive against that.
        wait(30*1000);

       

      If we set a timeout for Channel termination on close(), it may help to forcefully terminate the channel when sideB does not send the command back after a timeout (e.g. 1 minute)

        Attachments

          Issue Links

            Activity

            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            I think it's just JENKINS-44785 now. JENKINS-45294 and JENKINS-45023 should have improved the situation a lot, but we still need timeouts to break this wait() cycle

            Show
            oleg_nenashev Oleg Nenashev added a comment - I think it's just JENKINS-44785 now. JENKINS-45294 and JENKINS-45023 should have improved the situation a lot, but we still need timeouts to break this wait() cycle

              People

              • Assignee:
                Unassigned
                Reporter:
                oleg_nenashev Oleg Nenashev
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: