We encountered today a situation where one of our slaves was totally locked.
- Jobs would launch but get no futher than
Building remotely on XXX in workspace YYY Starting build job ZZZ
- No apparent problematic entries in the master log
- Status showed the slave as online
- No apparent problematic entries in the slave log, entries just stopped at the time when the problem started
Taking a stack trace showed that all threads were stuck in the following stack frame (full stack trace attached)
"pool-1-thread-10786" prio=3 tid=0x08461800 nid=0x4e43 in Object.wait() [0xb5088000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xbade43b0> (a hudson.remoting.PipeWindow$Real) at java.lang.Object.wait(Object.java:485) at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:177) - locked <0xbade43b0> (a hudson.remoting.PipeWindow$Real) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:118) - locked <0xbade43d8> (a hudson.remoting.ProxyOutputStream) at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103) at hudson.Util.copyStream(Util.java:454) at hudson.FilePath$28.call(FilePath.java:1623) at hudson.FilePath$28.call(FilePath.java:1617) at hudson.remoting.UserRequest.perform(UserRequest.java:118) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:326) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at hudson.remoting.Engine$1$1.run(Engine.java:60) at java.lang.Thread.run(Unknown Source)
Looking at the code of PipeWindow$Real.get() it does not look totally impossible that threads get stuck in get() and never woken up if the pipe fills up. But I can't really point at a concrete problem.
I checked the issues and found JENKINS-9540 and JENKINS-22807, but those seem different, with particular messages in the logs.
Could this be a deadlock in the slave remoting code?