-
Type:
Bug
-
Status: Reopened (View Workflow)
-
Priority:
Minor
-
Resolution: Unresolved
-
Component/s: maven-plugin, remoting
-
Labels:
-
Environment:Linux
-
Similar Issues:
I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):
while(response==null && !channel.isInClosed()) // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel, // but in production I've observed that in rare occasion it can block forever, even after a channel // is gone. So be defensive against that. wait(30*1000);
When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.
To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:
tc qdisc add dev docker0 root netem tc qdisc change dev docker0 root netem corrupt 1
Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...
To remove netem settings, just run tc qdisc del dev docker0 root.
I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.
- depends on
-
JENKINS-22252 Maven 3.2.1: IllegalAccessError on AbstractMapBasedMultimap
-
- Closed
-
- is related to
-
JENKINS-10840 Maven "module" shows as running after build is aborted.
-
- Open
-
- links to
Field | Original Value | New Value |
---|---|---|
Description |
I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):
{{ while(response==null && !channel.isInClosed()) // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel, // but in production I've observed that in rare occasion it can block forever, even after a channel // is gone. So be defensive against that. wait(30*1000); }} When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait. To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this: tc qdisc add dev docker0 root netem tc qdisc change dev docker0 root netem corrupt 1 Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem... To remove netem settings, just run tc qdisc del dev docker0 root. I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master. |
I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):
{code} while(response==null && !channel.isInClosed()) // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel, // but in production I've observed that in rare occasion it can block forever, even after a channel // is gone. So be defensive against that. wait(30*1000); {code} When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait. To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this: {code} tc qdisc add dev docker0 root netem tc qdisc change dev docker0 root netem corrupt 1 {code} Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem... To remove netem settings, just run tc qdisc del dev docker0 root. I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master. |
Summary | Unattended wait in the remoting code | Maven job stuck when slave channel get disconnected |
Component/s | maven-plugin [ 16033 ] | |
Component/s | remoting [ 15489 ] |
Remote Link | This issue links to "PR 39 (Web Link)" [ 12176 ] |
Assignee | Yoann Dubreuil [ ydubreuil ] |
Labels | remoting robustness slave |
Link | This issue is related to JENKINS-10840 [ JENKINS-10840 ] |
Status | Open [ 1 ] | Resolved [ 5 ] |
Resolution | Fixed [ 1 ] |
Link |
This issue depends on |
Resolution | Fixed [ 1 ] | |
Status | Resolved [ 5 ] | Reopened [ 4 ] |
Workflow | JNJira [ 161132 ] | JNJira + In-Review [ 186267 ] |
Assignee | Yoann Dubreuil [ ydubreuil ] | Oleg Nenashev [ oleg_nenashev ] |
Component/s | remoting [ 15489 ] |
Assignee | Oleg Nenashev [ oleg_nenashev ] |
Is this a security issue? E.g. is this exploitable by third parties to disrupt network reachable Jenkins service?