Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26947

Maven job stuck when slave channel get disconnected

    Details

    • Similar Issues:

      Description

      I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

      while(response==null && !channel.isInClosed())
        // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
        // but in production I've observed that in rare occasion it can block forever, even after a channel
        // is gone. So be defensive against that.
        wait(30*1000);
      

      When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

      To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

      tc qdisc add dev docker0 root netem
      tc qdisc change dev docker0 root netem corrupt 1
      

      Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

      To remove netem settings, just run tc qdisc del dev docker0 root.

      I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.

        Attachments

        1. Dockerfile
          0.4 kB
        2. launch.sh
          0.0 kB
        3. stacktrace.txt
          44 kB

          Issue Links

            Activity

            ydubreuil Yoann Dubreuil created issue -
            ydubreuil Yoann Dubreuil made changes -
            Field Original Value New Value
            Description I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

            {{
            while(response==null && !channel.isInClosed())
              // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
              // but in production I've observed that in rare occasion it can block forever, even after a channel
              // is gone. So be defensive against that.
              wait(30*1000);
            }}

            When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

            To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

            tc qdisc add dev docker0 root netem
            tc qdisc change dev docker0 root netem corrupt 1

            Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

            To remove netem settings, just run tc qdisc del dev docker0 root.

            I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.
            I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

            {code}
            while(response==null && !channel.isInClosed())
              // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
              // but in production I've observed that in rare occasion it can block forever, even after a channel
              // is gone. So be defensive against that.
              wait(30*1000);
            {code}

            When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

            To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

            {code}
            tc qdisc add dev docker0 root netem
            tc qdisc change dev docker0 root netem corrupt 1
            {code}

            Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

            To remove netem settings, just run tc qdisc del dev docker0 root.

            I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.
            ydubreuil Yoann Dubreuil made changes -
            Summary Unattended wait in the remoting code Maven job stuck when slave channel get disconnected
            Component/s maven-plugin [ 16033 ]
            Component/s remoting [ 15489 ]
            jglick Jesse Glick made changes -
            Remote Link This issue links to "PR 39 (Web Link)" [ 12176 ]
            jglick Jesse Glick made changes -
            Assignee Yoann Dubreuil [ ydubreuil ]
            jglick Jesse Glick made changes -
            Labels remoting robustness slave
            jglick Jesse Glick made changes -
            Link This issue is related to JENKINS-10840 [ JENKINS-10840 ]
            scm_issue_link SCM/JIRA link daemon made changes -
            Status Open [ 1 ] Resolved [ 5 ]
            Resolution Fixed [ 1 ]
            jglick Jesse Glick made changes -
            Link This issue depends on JENKINS-22252 [ JENKINS-22252 ]
            jglick Jesse Glick made changes -
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 161132 ] JNJira + In-Review [ 186267 ]
            oleg_nenashev Oleg Nenashev made changes -
            Assignee Yoann Dubreuil [ ydubreuil ] Oleg Nenashev [ oleg_nenashev ]
            oleg_nenashev Oleg Nenashev made changes -
            Component/s remoting [ 15489 ]
            oleg_nenashev Oleg Nenashev made changes -
            Assignee Oleg Nenashev [ oleg_nenashev ]

              People

              • Assignee:
                Unassigned
                Reporter:
                ydubreuil Yoann Dubreuil
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: