Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26947

Maven job stuck when slave channel get disconnected

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

      while(response==null && !channel.isInClosed())
        // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
        // but in production I've observed that in rare occasion it can block forever, even after a channel
        // is gone. So be defensive against that.
        wait(30*1000);
      

      When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

      To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

      tc qdisc add dev docker0 root netem
      tc qdisc change dev docker0 root netem corrupt 1
      

      Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

      To remove netem settings, just run tc qdisc del dev docker0 root.

      I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.

        Attachments

        1. Dockerfile
          0.4 kB
        2. launch.sh
          0.0 kB
        3. stacktrace.txt
          44 kB

          Issue Links

            Activity

            Hide
            danielbeck Daniel Beck added a comment -

            Is this a security issue? E.g. is this exploitable by third parties to disrupt network reachable Jenkins service?

            Show
            danielbeck Daniel Beck added a comment - Is this a security issue? E.g. is this exploitable by third parties to disrupt network reachable Jenkins service?
            Hide
            ydubreuil Yoann Dubreuil added a comment -

            You must be on the path of the network stream to be able to change the packet content. Even if you are able to get there, the SSH protocol protects the content of the stream. You would only be able to trigger a disconnection, but at this stage, I bet a lot of other network services are in danger.

            So for me, it's not a security issue.

            Show
            ydubreuil Yoann Dubreuil added a comment - You must be on the path of the network stream to be able to change the packet content. Even if you are able to get there, the SSH protocol protects the content of the stream. You would only be able to trigger a disconnection, but at this stage, I bet a lot of other network services are in danger. So for me, it's not a security issue.
            Hide
            teilo James Nord added a comment -

            Have you disabled the PIngThread at all?

            AFAICT netem does not kill the connection - the remote end will be retransmitting the packets - and as such the channel is not closed.
            The PingThread should eventually notice this (10 minutes interval + 4 minute timeout) so after at most 14 minutes the connection should be killed and this thread unblock.

            Show
            teilo James Nord added a comment - Have you disabled the PIngThread at all? AFAICT netem does not kill the connection - the remote end will be retransmitting the packets - and as such the channel is not closed. The PingThread should eventually notice this (10 minutes interval + 4 minute timeout) so after at most 14 minutes the connection should be killed and this thread unblock.
            Hide
            ydubreuil Yoann Dubreuil added a comment -

            No, I did not disable the ping thread. In fact, I did nothing special, just started a fresh Jenkins instance and connected it to this docker slave. I took the thread dump 30 minutes after the disconnection. Will relaunch the test this afternoon to see if it would ever times out or not.

            Show
            ydubreuil Yoann Dubreuil added a comment - No, I did not disable the ping thread. In fact, I did nothing special, just started a fresh Jenkins instance and connected it to this docker slave. I took the thread dump 30 minutes after the disconnection. Will relaunch the test this afternoon to see if it would ever times out or not.
            Hide
            ydubreuil Yoann Dubreuil added a comment -

            The executor thread is still blocked 2 hours after slave gets disconnected.

            Stack of the blocked thread is:

            "Executor #0 for dumb-docker-ssh : executing gameoflife #41 / waiting for hudson.remoting.Channel@7b736159:Channel to Maven [java, -cp, /home/jenkins/bis/maven31-agent.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/boot/plexus-classworlds-2.5.1.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/conf/logging, jenkins.maven3.agent.Maven31Main, /home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2, /home/jenkins/bis/slave.jar, /home/jenkins/bis/maven31-interceptor.jar, /home/jenkins/bis/maven3-interceptor-commons.jar, 42104]" daemon prio=10 tid=0x00007f11f0187800 nid=0x6d4f in Object.wait() [0x00007f1245eb2000]
               java.lang.Thread.State: TIMED_WAITING (on object monitor)
            	at java.lang.Object.wait(Native Method)
            	- waiting on <0x00000000879416d8> (a hudson.remoting.UserRequest)
            	at hudson.remoting.Request.call(Request.java:146)
            	- locked <0x00000000879416d8> (a hudson.remoting.UserRequest)
            	at hudson.remoting.Channel.call(Channel.java:751)
            	at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:161)
            	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:840)
            	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533)
            	at hudson.model.Run.execute(Run.java:1745)
            	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:529)
            	at hudson.model.ResourceController.execute(ResourceController.java:89)
            	at hudson.model.Executor.run(Executor.java:240)
            
            Show
            ydubreuil Yoann Dubreuil added a comment - The executor thread is still blocked 2 hours after slave gets disconnected. Stack of the blocked thread is: "Executor #0 for dumb-docker-ssh : executing gameoflife #41 / waiting for hudson.remoting.Channel@7b736159:Channel to Maven [java, -cp, /home/jenkins/bis/maven31-agent.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/boot/plexus-classworlds-2.5.1.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/conf/logging, jenkins.maven3.agent.Maven31Main, /home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2, /home/jenkins/bis/slave.jar, /home/jenkins/bis/maven31-interceptor.jar, /home/jenkins/bis/maven3-interceptor-commons.jar, 42104]" daemon prio=10 tid=0x00007f11f0187800 nid=0x6d4f in Object .wait() [0x00007f1245eb2000] java.lang. Thread .State: TIMED_WAITING (on object monitor) at java.lang. Object .wait(Native Method) - waiting on <0x00000000879416d8> (a hudson.remoting.UserRequest) at hudson.remoting.Request.call(Request.java:146) - locked <0x00000000879416d8> (a hudson.remoting.UserRequest) at hudson.remoting.Channel.call(Channel.java:751) at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:161) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:840) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533) at hudson.model.Run.execute(Run.java:1745) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:529) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)
            Hide
            teilo James Nord added a comment - - edited

            possibly a duplicate of JENKINS-10840

            Soemthing strange is going on with Docker and tc.

            with 2 freestyle builds I see a failure and the salve is disconnected with.

            java.io.IOException: remote file operation failed: /home/jenkins/data/jenkins-slave.exe at hudson.remoting.Channel@7407d0f5:docker_ssh: hudson.remoting.ChannelClosedException: channel is already closed
            	at hudson.FilePath.act(FilePath.java:985)
            	at hudson.FilePath.act(FilePath.java:967)
            	at hudson.FilePath.exists(FilePath.java:1435)
            	at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:46)
            	at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:37)
            	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            	at java.lang.Thread.run(Thread.java:745)
            Caused by: hudson.remoting.ChannelClosedException: channel is already closed
            	at hudson.remoting.Channel.send(Channel.java:549)
            	at hudson.remoting.Request.call(Request.java:129)
            	at hudson.remoting.Channel.call(Channel.java:751)
            	at hudson.FilePath.act(FilePath.java:978)
            	... 9 more
            Caused by: java.io.IOException: Unexpected termination of the channel
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
            Caused by: java.io.EOFException
            	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325)
            	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
            	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
            	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
            	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40)
            	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
            ERROR: Socket connection to SSH server was lost
            java.io.IOException: Peer sent DISCONNECT message (reason code 2): Packet corrupt
            	at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766)
            	at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489)
            	at java.lang.Thread.run(Thread.java:745)
            

            But a single bit packet corruption should cause the packet to be thrown away by the OS layer due to a TCP checksum miss-match and not to be seen by the application.

            The other interesting thing is that a build can be runnign fine and it only dies when a new build is kicked off - I would not expect an issue in setting up a new channel in the multiplex (from what KK said) to fail the other channels.

            Show
            teilo James Nord added a comment - - edited possibly a duplicate of JENKINS-10840 Soemthing strange is going on with Docker and tc. with 2 freestyle builds I see a failure and the salve is disconnected with. java.io.IOException: remote file operation failed: /home/jenkins/data/jenkins-slave.exe at hudson.remoting.Channel@7407d0f5:docker_ssh: hudson.remoting.ChannelClosedException: channel is already closed at hudson.FilePath.act(FilePath.java:985) at hudson.FilePath.act(FilePath.java:967) at hudson.FilePath.exists(FilePath.java:1435) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:46) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:37) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:549) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:751) at hudson.FilePath.act(FilePath.java:978) ... 9 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) ERROR: Socket connection to SSH server was lost java.io.IOException: Peer sent DISCONNECT message (reason code 2): Packet corrupt at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766) at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489) at java.lang.Thread.run(Thread.java:745) But a single bit packet corruption should cause the packet to be thrown away by the OS layer due to a TCP checksum miss-match and not to be seen by the application. The other interesting thing is that a build can be runnign fine and it only dies when a new build is kicked off - I would not expect an issue in setting up a new channel in the multiplex (from what KK said) to fail the other channels.
            Hide
            teilo James Nord added a comment -

            FWIW the original report has nothing to do with packet corruption - just the channel dying.

            You can get the same results with a "kill -9" on the slave.

            Show
            teilo James Nord added a comment - FWIW the original report has nothing to do with packet corruption - just the channel dying. You can get the same results with a "kill -9" on the slave.
            Hide
            ydubreuil Yoann Dubreuil added a comment -

            Yes that's right. I found the problem when playing with netem, hence the bug report.

            It's a bug in the Maven plugin. When upstream channel is closed, Maven channel stays around. Will post a PR shortly.

            Show
            ydubreuil Yoann Dubreuil added a comment - Yes that's right. I found the problem when playing with netem, hence the bug report. It's a bug in the Maven plugin. When upstream channel is closed, Maven channel stays around. Will post a PR shortly.
            Hide
            ydubreuil Yoann Dubreuil added a comment -
            Show
            ydubreuil Yoann Dubreuil added a comment - Just created a PR: https://github.com/jenkinsci/maven-plugin/pull/39
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Yoann Dubreuil
            Path:
            src/main/java/hudson/maven/AbstractMavenProcessFactory.java
            http://jenkins-ci.org/commit/maven-plugin/47b28737803ef90bc7e6518e159df28471988bae
            Log:
            [FIX JENKINS-26947] forcibly terminate Maven remoting channel when upstream channel get closed

            Currently, when the main remoting channel is abruptly closed, Maven channel can be stuck for a while because it doesn't get notified of the disconnection

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Yoann Dubreuil Path: src/main/java/hudson/maven/AbstractMavenProcessFactory.java http://jenkins-ci.org/commit/maven-plugin/47b28737803ef90bc7e6518e159df28471988bae Log: [FIX JENKINS-26947] forcibly terminate Maven remoting channel when upstream channel get closed Currently, when the main remoting channel is abruptly closed, Maven channel can be stuck for a while because it doesn't get notified of the disconnection
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Nicolas De loof
            Path:
            src/main/java/hudson/maven/AbstractMavenProcessFactory.java
            http://jenkins-ci.org/commit/maven-plugin/7171b20fd0d6b783c1f9e7b0ad09e031db47b4ac
            Log:
            Merge pull request #39 from ydubreuil/fix-JENKINS-26947

            JENKINS-26947 forcibly terminate Maven remoting channel when upstream channel get closed

            Compare: https://github.com/jenkinsci/maven-plugin/compare/ea1a77405c6e...7171b20fd0d6

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Nicolas De loof Path: src/main/java/hudson/maven/AbstractMavenProcessFactory.java http://jenkins-ci.org/commit/maven-plugin/7171b20fd0d6b783c1f9e7b0ad09e031db47b4ac Log: Merge pull request #39 from ydubreuil/fix- JENKINS-26947 JENKINS-26947 forcibly terminate Maven remoting channel when upstream channel get closed Compare: https://github.com/jenkinsci/maven-plugin/compare/ea1a77405c6e...7171b20fd0d6
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/hudson/maven/AbstractMavenProcessFactory.java
            http://jenkins-ci.org/commit/maven-plugin/51b204897c19c9be3b03a0a80482d096ae1db3e3
            Log:
            Revert "[FIX JENKINS-26947] forcibly terminate Maven remoting channel when upstream channel get closed"
            [FIXED JENKINS-22252] Caused a regression.
            This reverts commit 47b28737803ef90bc7e6518e159df28471988bae.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/hudson/maven/AbstractMavenProcessFactory.java http://jenkins-ci.org/commit/maven-plugin/51b204897c19c9be3b03a0a80482d096ae1db3e3 Log: Revert " [FIX JENKINS-26947] forcibly terminate Maven remoting channel when upstream channel get closed" [FIXED JENKINS-22252] Caused a regression. This reverts commit 47b28737803ef90bc7e6518e159df28471988bae.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/hudson/maven/AbstractMavenProcessFactory.java
            http://jenkins-ci.org/commit/maven-plugin/e85502ac9b4e05f8ff8a476c795057a770cdc97f
            Log:
            Merge pull request #52 from jglick/IllegalAccessError-JENKINS-22252

            JENKINS-22252 Revert JENKINS-26947 fix

            Compare: https://github.com/jenkinsci/maven-plugin/compare/22dcb14846e2...e85502ac9b4e

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/hudson/maven/AbstractMavenProcessFactory.java http://jenkins-ci.org/commit/maven-plugin/e85502ac9b4e05f8ff8a476c795057a770cdc97f Log: Merge pull request #52 from jglick/IllegalAccessError- JENKINS-22252 JENKINS-22252 Revert JENKINS-26947 fix Compare: https://github.com/jenkinsci/maven-plugin/compare/22dcb14846e2...e85502ac9b4e
            Hide
            jglick Jesse Glick added a comment -

            Reverting fix.

            Show
            jglick Jesse Glick added a comment - Reverting fix.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            It is still applicable AFAICT though some related fixes have been applied on the Remoting side

            Show
            oleg_nenashev Oleg Nenashev added a comment - It is still applicable AFAICT though some related fixes have been applied on the Remoting side
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            No plan to work on it anytime soon

            Show
            oleg_nenashev Oleg Nenashev added a comment - No plan to work on it anytime soon

              People

              • Assignee:
                Unassigned
                Reporter:
                ydubreuil Yoann Dubreuil
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: