Loading...

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Component/s: remoting, ssh-slaves-plugin, subversion-plugin
Labels:
- core
- slave
- ssh-slave
- svn
Environment:
CentOS 6.7 on Master and Slave
Jenkins v1.639
SSH Slaves plugin v1.10
java-1.7.0-openjdk-1.7.0.91-2.6.2.2.el6_7.x86_64

Similar Issues:

Show

Hi,

I am seeing a reproducible issue where my Jenkins execution slave will crash if two jobs try to checkout the same file from SVN at the same time (to their own, individual workspaces).

My setup:
I have an environment with a Jenkins Master and single Jenkins Slave. The Master has 0 executors and the Slave has 50 executors. We use SVN for our SCM and have ~100 jobs.

Our 100 jobs are divided into products where each product has about 10 jobs that are nearly identical, but go to a different build folder within the checkout to build a different configuration of the product. This means that the same repository (and all of its externals) is checked out 10 times on a commit to build 10 different configures of very similar code.

Recently we have been busy and see a lot of commits, each one launching 10 nearly identical jobs, which is exactly what we want. The issue is that (as far as I can tell) two of the jobs attempt to checkout the same file at the same time, and then the slave crashes with a message like this:

FATAL: java.io.IOException: Unexpected termination of the channel
hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.Request.abort(Request.java:297)
	at hudson.remoting.Channel.terminate(Channel.java:847)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
	at ......remote call to Umbreon(Native Method)
	at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1416)
	at hudson.remoting.Request.call(Request.java:172)
	at hudson.remoting.Channel.call(Channel.java:780)
	at hudson.FilePath.act(FilePath.java:979)
	at hudson.FilePath.act(FilePath.java:968)
	at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:848)
	at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:786)
	at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
	at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:607)
	at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
	at hudson.model.Run.execute(Run.java:1738)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
	at hudson.model.ResourceController.execute(ResourceController.java:98)
	at hudson.model.Executor.run(Executor.java:410)
Caused by: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2332)
	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2801)
	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

Case1:
I set up a test environment where I have a master and slave, the slave with 10 executors. I create a test job that just checks out a large repository and then exits. Then I rapidly launch the same job 10 times. The idea is that each job will get its own workspace, checkout the repository and exit, but most of the time (but not always) the slave process is killed with a SIGABRT and then the master restarts the slave. I could see that at least two jobs were trying to checkout the same file at the same time.

Case2:
I set up another test environment where I have a master and 10 slaves, each slave with 1 executor. I create a test job that just checks out a large repository and then exits. Then I rapidly launch the same job 10 times. The idea is that each job will get its own executor, own workspace, checkout the repository and exit. This works 100% of the times that I tried it and I saw no failures.

Case3:
I set up another test environment where I have a master and 10 slaves, each slave with 10 executors. I create a test job that just checks out a large repository and then exits. Then I rapidly launch the same job 100 times. The idea is that each executor will get 10 jobs, each job will get its own workspace, checkout the repository and exit. This was a good test case because only 5/10 of the slaves failed and restarted the slave.jar, the rest were ok. Of the slaves that failed, I could see that at least two jobs were trying to checkout the same file at the same time.

Details

Description

Attachments

Activity

People

Dates