Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Component/s: core, maven-plugin
    • Labels:
      None
    • Environment:
      core 1.564-SNAPSHOT, remoting 2.41
    • Similar Issues:

      Description

      On a number of the slaves at builds.apache.org, we're seeing slaves hanging after a while, both Linux and Windows slaves. The common thread seems to be Maven jobs being run on them and eventually hanging, causing everything else on the slave to hang (including, in some cases, attempts to get the threaddump from within Jenkins). The original Maven build hangs indefinitely, and any subsequent builds trying to run on the same slave get to the point of starting the git clone/svn checkout/etc and then just hang. The Linux slaves are running Java 1.8.0_05, and the Windows are running some Java 7 version - not sure which.

      Threaddump for Linux is at https://gist.github.com/abayer/3d567b56776e1ce78ad7 (one job hanging for over a day, another that started an hour or so ago but is now hanging), threaddump for Windows is at https://gist.github.com/abayer/c99f72ca1232e4d8acfa (only one job running at all on there, hanging for 17 hours or so).

        Attachments

          Issue Links

            Activity

            Hide
            abayer Andrew Bayer added a comment -

            Kohsuke Kawaguchi, Jesse Glick - any ideas? I don't know where to start. For what it's worth, the forked off Maven process for the hung job is still running in these cases, but not doing anything...

            Show
            abayer Andrew Bayer added a comment - Kohsuke Kawaguchi , Jesse Glick - any ideas? I don't know where to start. For what it's worth, the forked off Maven process for the hung job is still running in these cases, but not doing anything...
            Hide
            jglick Jesse Glick added a comment -

            Do not see any clues there. The master thread dump might also be relevant. Better to install the Support Core plugin and attach a diagnostic bundle that would have everything.

            Show
            jglick Jesse Glick added a comment - Do not see any clues there. The master thread dump might also be relevant. Better to install the Support Core plugin and attach a diagnostic bundle that would have everything.
            Hide
            jglick Jesse Glick added a comment -

            (And consider using freestyle projects, which are much less trouble-prone.)

            Show
            jglick Jesse Glick added a comment - (And consider using freestyle projects, which are much less trouble-prone.)
            Hide
            abayer Andrew Bayer added a comment -

            Yeah, I'd love to get off the Maven projects, but, well, there's 600 or so of them (out of 1150 or so jobs) and they're pretty well entrenched. If we can't resolve this, I'll try to start the ball rolling on a complete rebuild of the Apache Jenkins setup with the Maven plugin explicitly removed, but that'll be a giant pain in the ass given the fact that we're talking about a massive number of separate ASF projects each with their own teams, etc, etc...yeah.

            Installing Support Core now, and full thread dump up at https://gist.github.com/abayer/7ff4de807c6373eec40d.

            Might be worth mentioning that we see absolutely no hangs like this on the hadoopX slaves, which only run freestyle jobs, so far as I can tell, so it definitely looks like a problem in the Maven plugin...

            Show
            abayer Andrew Bayer added a comment - Yeah, I'd love to get off the Maven projects, but, well, there's 600 or so of them (out of 1150 or so jobs) and they're pretty well entrenched. If we can't resolve this, I'll try to start the ball rolling on a complete rebuild of the Apache Jenkins setup with the Maven plugin explicitly removed, but that'll be a giant pain in the ass given the fact that we're talking about a massive number of separate ASF projects each with their own teams, etc, etc...yeah. Installing Support Core now, and full thread dump up at https://gist.github.com/abayer/7ff4de807c6373eec40d . Might be worth mentioning that we see absolutely no hangs like this on the hadoopX slaves, which only run freestyle jobs, so far as I can tell, so it definitely looks like a problem in the Maven plugin...
            Hide
            abayer Andrew Bayer added a comment -

            ...and fwiw, in the new version of my Jenkins best practices talk, I harp quite a bit on how you should never use the Maven plugin because it's a morass of pain. =)

            Show
            abayer Andrew Bayer added a comment - ...and fwiw, in the new version of my Jenkins best practices talk, I harp quite a bit on how you should never use the Maven plugin because it's a morass of pain. =)
            Hide
            abayer Andrew Bayer added a comment -

            And also fwiw, the support core plugin doesn't actually seem to give me a real bundle. I'm guessing because the whole master is so borked. =)

            Show
            abayer Andrew Bayer added a comment - And also fwiw, the support core plugin doesn't actually seem to give me a real bundle. I'm guessing because the whole master is so borked. =)
            Hide
            jglick Jesse Glick added a comment -

            Handling GET /job/Mahout-Quality/ws/trunk/examples/target/site/apidocs/index.html sounds bad. Is someone seriously trying to load a generated site from the workspace? Avoid (remote) workspace browsing whenever possible.

            Show
            jglick Jesse Glick added a comment - Handling GET /job/Mahout-Quality/ws/trunk/examples/target/site/apidocs/index.html sounds bad. Is someone seriously trying to load a generated site from the workspace? Avoid (remote) workspace browsing whenever possible.
            Hide
            jglick Jesse Glick added a comment -

            And Handling GET /job/river-qa-refactor-j9/ws/trunk/qa/result/*zip*/result.zip is even worse. Teach people to archive artifacts, then start disabling workspace browse permission. You are getting DoS’d I think.

            Show
            jglick Jesse Glick added a comment - And Handling GET /job/river-qa-refactor-j9/ws/trunk/qa/result/*zip*/result.zip is even worse. Teach people to archive artifacts, then start disabling workspace browse permission. You are getting DoS’d I think.
            Hide
            abayer Andrew Bayer added a comment -

            Yeah, quite aware of that from another JIRA I opened. I've turned off anonymous workspace read access and am trying to get people to stop linking to workspaces in general, but again, at ASF it's hard to get everyone to even notice the emails I send them about what they should stop doing, let alone actually stop doing it. Fun!

            Show
            abayer Andrew Bayer added a comment - Yeah, quite aware of that from another JIRA I opened. I've turned off anonymous workspace read access and am trying to get people to stop linking to workspaces in general, but again, at ASF it's hard to get everyone to even notice the emails I send them about what they should stop doing, let alone actually stop doing it. Fun!
            Hide
            abayer Andrew Bayer added a comment -

            Just as an experiment, I'm disabling workspace read for everyone but admins, so we'll see how that goes.

            Show
            abayer Andrew Bayer added a comment - Just as an experiment, I'm disabling workspace read for everyone but admins, so we'll see how that goes.
            Hide
            abayer Andrew Bayer added a comment -

            Ok, got the support bundle to generate properly using the CLI. I'm going to give it a day or so post-restart with workspace read off, see if we have hangs, and if so, get a bundle here.

            Show
            abayer Andrew Bayer added a comment - Ok, got the support bundle to generate properly using the CLI. I'm going to give it a day or so post-restart with workspace read off, see if we have hangs, and if so, get a bundle here.
            Hide
            abayer Andrew Bayer added a comment -

            So we've downgraded from 1.564-SNAPSHOT to 1.554.1 and that seems to have solved the problem - makes me guess that the problem is somewhere in the remoting changes between 1.554 and 1.564.

            Show
            abayer Andrew Bayer added a comment - So we've downgraded from 1.564-SNAPSHOT to 1.554.1 and that seems to have solved the problem - makes me guess that the problem is somewhere in the remoting changes between 1.554 and 1.564.
            Hide
            jglick Jesse Glick added a comment -

            Did you pick up the JENKINS-22734 fix in 1.563? Running a snapshot build is not wise unless you are really prepared to review ongoing commits.

            Show
            jglick Jesse Glick added a comment - Did you pick up the JENKINS-22734 fix in 1.563? Running a snapshot build is not wise unless you are really prepared to review ongoing commits.
            Hide
            abayer Andrew Bayer added a comment -

            Don't think we had - I want to get us off SNAPSHOTs, period, so yeah. That said, the symptoms described in that JIRA don't seem to match the ones we were seeing - the slaves were still "connected", just hung.

            Show
            abayer Andrew Bayer added a comment - Don't think we had - I want to get us off SNAPSHOTs, period, so yeah. That said, the symptoms described in that JIRA don't seem to match the ones we were seeing - the slaves were still "connected", just hung.
            Hide
            abayer Andrew Bayer added a comment -

            Got another hang now on 1.554.1 - the Maven interceptor running on the slave is hung eating 99% of CPU for hours. Its thread dump is at https://gist.github.com/abayer/bc554112335fe229ddfe.

            Show
            abayer Andrew Bayer added a comment - Got another hang now on 1.554.1 - the Maven interceptor running on the slave is hung eating 99% of CPU for hours. Its thread dump is at https://gist.github.com/abayer/bc554112335fe229ddfe .
            Hide
            jglick Jesse Glick added a comment -

            That thread dump looks idle to me. Not sure what you are hitting.

            Show
            jglick Jesse Glick added a comment - That thread dump looks idle to me. Not sure what you are hitting.
            Hide
            abayer Andrew Bayer added a comment -

            Very weird. It was idling at 99% CPU for 3 hours after the log said Maven was done, so...weird.

            Show
            abayer Andrew Bayer added a comment - Very weird. It was idling at 99% CPU for 3 hours after the log said Maven was done, so...weird.
            Hide
            tbridges Tony Bridges added a comment -

            This looks very similar to what I am seeing on Windows master/slave running 1.554.3 with maven plugin 2.4. I'm also seeing a particular maven job (not all) consistently hanging up after metadata collection.

            Show
            tbridges Tony Bridges added a comment - This looks very similar to what I am seeing on Windows master/slave running 1.554.3 with maven plugin 2.4. I'm also seeing a particular maven job (not all) consistently hanging up after metadata collection.
            Hide
            tbridges Tony Bridges added a comment -

            That latter hang, by the way, is not present with the maven plugin 2.1 after a downgrade. That might be a useful data point.

            Show
            tbridges Tony Bridges added a comment - That latter hang, by the way, is not present with the maven plugin 2.1 after a downgrade. That might be a useful data point.
            Hide
            wilm Wilm Schomburg added a comment -

            We had the same issue with the maven plugin 2.3 and different Jenkins versions (1.554.2, 1.554.1 and older non-LTS versions). We had to downgrade to 2.1 to solve the issue and get our Jenkins stable again.

            Show
            wilm Wilm Schomburg added a comment - We had the same issue with the maven plugin 2.3 and different Jenkins versions (1.554.2, 1.554.1 and older non-LTS versions). We had to downgrade to 2.1 to solve the issue and get our Jenkins stable again.
            Hide
            jglick Jesse Glick added a comment -

            @tbridges @wilm if you can reproduce the problem easily in newer plugin versions but not older, we really need you to git bisect until you find the plugin commit introducing the problem, since I at least have no other leads.

            Show
            jglick Jesse Glick added a comment - @tbridges @wilm if you can reproduce the problem easily in newer plugin versions but not older, we really need you to git bisect until you find the plugin commit introducing the problem, since I at least have no other leads.
            Hide
            jglick Jesse Glick added a comment -

            Looks like the fix of JENKINS-22354, in 2.2, may have introduced this bug.

            Show
            jglick Jesse Glick added a comment - Looks like the fix of JENKINS-22354 , in 2.2, may have introduced this bug.
            Hide
            kohsuke Kohsuke Kawaguchi added a comment -

            thread dump from abayer shows that something weird is happening with SplittableBuildListener.

            Below is my analysis of the issue from one of our customers (ZD-19531), which turns out to be the same problem:

            3 threads appear to be blocked on SplittableBuildListener.synchronizeOnMark of the same object, which is odd, as the execution of this is supposed to be sequential.

            • Computer.threadPoolForRemoting [#1099] is waiting to enter SplittableBuildListener.synchronizeOnMark.
            • Computer.threadPoolForRemoting [#1108] is inside synchronizeOnMark and on markCountLock.wait.
            • Computer.threadPoolForRemoting [#1113] has found the mark and trying to report that, but blocked to get in
            • Computer.threadPoolForRemoting [#1104] is inside synchronizeOnMark waiting for Future.get()

            I think there's incorrect use of synchronization here. When wait() happens, the lock is released, which allows another thread to enter synchronizedOnMark. We need to use another lock to ensure synchronizeOnMark is not concurrently invoked.

            Show
            kohsuke Kohsuke Kawaguchi added a comment - thread dump from abayer shows that something weird is happening with SplittableBuildListener . Below is my analysis of the issue from one of our customers (ZD-19531), which turns out to be the same problem: — 3 threads appear to be blocked on SplittableBuildListener.synchronizeOnMark of the same object, which is odd, as the execution of this is supposed to be sequential. Computer.threadPoolForRemoting [#1099] is waiting to enter SplittableBuildListener.synchronizeOnMark. Computer.threadPoolForRemoting [#1108] is inside synchronizeOnMark and on markCountLock.wait. Computer.threadPoolForRemoting [#1113] has found the mark and trying to report that, but blocked to get in Computer.threadPoolForRemoting [#1104] is inside synchronizeOnMark waiting for Future.get() I think there's incorrect use of synchronization here. When wait() happens, the lock is released, which allows another thread to enter synchronizedOnMark. We need to use another lock to ensure synchronizeOnMark is not concurrently invoked.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Kohsuke Kawaguchi
            Path:
            src/main/java/hudson/maven/SplittableBuildListener.java
            http://jenkins-ci.org/commit/maven-plugin/b145d5925ddeae2d697743920da204e6991375ac
            Log:
            [FIXED JENKINS-23098]

            Reference: ZD-19531

            Looking at [4], one notices that three threads are in an effective dead lock state around synchronizeOnMark. I extracted relevant part into [5].

            Thread #1661 is trying to report a discovered mark, but blocking [1]. Thread #1665 is inside synchronizeOnMark, on markCountLock.wait() [2]. Thread #1667 is stuck on Future.get() and hasn't returned [3], which holds the lock that blocks [1] from unblocking [2].

            The root problem is that synchronizeOnMark method is never meant to be concurrently executed. But given the way the lock is used, if one thread gets to wait(), it's possible that another thread would come along and go into this function.

            In this change, I'm preventing that by introducing another lock to serialize the execution of the entire synchronizeOnMark() call. I'm not using the "this" object for locking because it's already used for another purpose (see the lock() method)

            I'm not yet clear on why the synchronizeOnMark() method is called concurrently to begin with. The interaction with the -T option of Maven is suspected.

            [1] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L2
            [2] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L34
            [3] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L71
            [4] https://gist.github.com/abayer/7ff4de807c6373eec40d
            [5] https://gist.github.com/kohsuke/374c22e737a77c9b0421

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/hudson/maven/SplittableBuildListener.java http://jenkins-ci.org/commit/maven-plugin/b145d5925ddeae2d697743920da204e6991375ac Log: [FIXED JENKINS-23098] Reference: ZD-19531 Looking at [4] , one notices that three threads are in an effective dead lock state around synchronizeOnMark. I extracted relevant part into [5] . Thread #1661 is trying to report a discovered mark, but blocking [1] . Thread #1665 is inside synchronizeOnMark, on markCountLock.wait() [2] . Thread #1667 is stuck on Future.get() and hasn't returned [3] , which holds the lock that blocks [1] from unblocking [2] . The root problem is that synchronizeOnMark method is never meant to be concurrently executed. But given the way the lock is used, if one thread gets to wait(), it's possible that another thread would come along and go into this function. In this change, I'm preventing that by introducing another lock to serialize the execution of the entire synchronizeOnMark() call. I'm not using the "this" object for locking because it's already used for another purpose (see the lock() method) I'm not yet clear on why the synchronizeOnMark() method is called concurrently to begin with. The interaction with the -T option of Maven is suspected. [1] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L2 [2] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L34 [3] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L71 [4] https://gist.github.com/abayer/7ff4de807c6373eec40d [5] https://gist.github.com/kohsuke/374c22e737a77c9b0421
            Hide
            kohsuke Kohsuke Kawaguchi added a comment -

            If you see this problem, can you please try out this build and report back if that fixes the problem?

            Show
            kohsuke Kohsuke Kawaguchi added a comment - If you see this problem, can you please try out this build and report back if that fixes the problem?
            Hide
            kohsuke Kohsuke Kawaguchi added a comment -

            Released Maven plugin 2.5 with this fix.

            Show
            kohsuke Kohsuke Kawaguchi added a comment - Released Maven plugin 2.5 with this fix.

              People

              • Assignee:
                kohsuke Kohsuke Kawaguchi
                Reporter:
                abayer Andrew Bayer
              • Votes:
                1 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: