Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-13330

Jenkins slave hangs in post build phase

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Labels:
      None
    • Environment:
      RHEL 5, both master and all slaves.
      Jenkins is running inside of Tomcat

      Description

      We have an intermittent problem with slaves hanging AFTER the job itself is finished. In the post processing step what we see is that the console log has this line:

      Description set: vap_current_iter-2012_03_29_19_01_03

      And then nothing. Usually, it will look like this:

      Description set: prod_pull-2012_03_28_19_01_03
      Notifying upstream build armada_Launch_prod_pull #13 of job completion
      Project armada_Launch_prod_pull still waiting for 1 builds to complete
      Notifying upstream projects of job completion
      Notifying upstream of completion: armada_Launch_prod_pull #13
      Finished: SUCCESS

      I setup a logger for hudson.model.Run, and it currently has this :

          at java.lang.Thread.run(Thread.java:619)
      
      Mar 30, 2012 12:44:00 PM hudson.model.Run run
      INFO: galleon_allUnit #1134 main build action completed: SUCCESS
      Mar 30, 2012 12:44:00 PM hudson.model.Run setResult
      FINE: galleon_allUnit #1134 : result is set to SUCCESS
      java.lang.Exception
          at hudson.model.Run.setResult(Run.java:352)
          at hudson.model.Run.run(Run.java:1410)
          at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:238)
      

      Repeated for every hung slave.

      The main hudson log doesn't have any additional information.

      Disconnecting the slave has no effect.

      Trying to do an orderly shutdown of Jenkins has no effect (jenkins actually appears to hang on shutdown).

      The only way we have found to recover is to kill -9 the tomcat process.

      The tread dump for one of the slaves (they are all the same) is:

      Thread Dump
      Channel reader thread: channel
      
      "Channel reader thread: channel" Id=9 Group=main RUNNABLE (in native)
          at java.io.FileInputStream.readBytes(Native Method)
          at java.io.FileInputStream.read(FileInputStream.java:199)
          at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
          at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
          -  locked java.io.BufferedInputStream@1ae615a
          at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249)
          at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542)
          at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552)
          at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
          at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
          at hudson.remoting.Channel$ReaderThread.run(Channel.java:1030)
      
      
      main
      
      "main" Id=1 Group=main WAITING on hudson.remoting.Channel@e1d5ea
          at java.lang.Object.wait(Native Method)
          -  waiting on hudson.remoting.Channel@e1d5ea
          at java.lang.Object.wait(Object.java:485)
          at hudson.remoting.Channel.join(Channel.java:766)
          at hudson.remoting.Launcher.main(Launcher.java:420)
          at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:366)
          at hudson.remoting.Launcher.run(Launcher.java:206)
          at hudson.remoting.Launcher.main(Launcher.java:168)
      
      
      Ping thread for channel hudson.remoting.Channel@e1d5ea:channel
      
      "Ping thread for channel hudson.remoting.Channel@e1d5ea:channel" Id=10 Group=main TIMED_WAITING
          at java.lang.Thread.sleep(Native Method)
          at hudson.remoting.PingThread.run(PingThread.java:86)
      
      
      Pipe writer thread: channel
      
      "Pipe writer thread: channel" Id=12 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@14263ed
          at sun.misc.Unsafe.park(Native Method)
          -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@14263ed
          at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
          at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
          at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
          at java.lang.Thread.run(Thread.java:619)
      
      
      pool-1-thread-267
      
      "pool-1-thread-267" Id=285 Group=main RUNNABLE
          at sun.management.ThreadImpl.dumpThreads0(Native Method)
          at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:374)
          at hudson.Functions.getThreadInfos(Functions.java:872)
          at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:93)
          at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:89)
          at hudson.remoting.UserRequest.perform(UserRequest.java:118)
          at hudson.remoting.UserRequest.perform(UserRequest.java:48)
          at hudson.remoting.Request$2.run(Request.java:287)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:619)
      
          Number of locked synchronizers = 1
          - java.util.concurrent.locks.ReentrantLock$NonfairSync@1186f88
      
      
      Finalizer
      
      "Finalizer" Id=3 Group=system WAITING on java.lang.ref.ReferenceQueue$Lock@1798fdd
          at java.lang.Object.wait(Native Method)
          -  waiting on java.lang.ref.ReferenceQueue$Lock@1798fdd
          at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
          at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
          at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
      
      
      Reference Handler
      
      "Reference Handler" Id=2 Group=system WAITING on java.lang.ref.Reference$Lock@1d40442
          at java.lang.Object.wait(Native Method)
          -  waiting on java.lang.ref.Reference$Lock@1d40442
          at java.lang.Object.wait(Object.java:485)
          at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
      
      
      Signal Dispatcher
      
      "Signal Dispatcher" Id=4 Group=system RUNNABLE
      

      Any ideas on how to better recover or prevent this would be greatly appreciated.

        Issue Links

          Activity

          Hide
          randallrschulz Randall Schulz added a comment -

          A thread dump of the Jenkins master process when there were 8 stalled (slave) jobs.

          Show
          randallrschulz Randall Schulz added a comment - A thread dump of the Jenkins master process when there were 8 stalled (slave) jobs.
          Hide
          randallrschulz Randall Schulz added a comment -

          I am experiencing what I believe to be the same problem. Because of the timing of the most recent occurrence of the event on my system, I had 8 jobs in this stalled state. Most interestingly, I discovered that stopping just one of these jobs (as in clicking the red X icon in the Build Executor Status section) caused not only that one job to terminate, 4 others went with it. When I stopped one of the remaining 3 jobs, they all disappeared.

          Perhaps this is a clue.

          I have attached a thread dump file.

          Show
          randallrschulz Randall Schulz added a comment - I am experiencing what I believe to be the same problem. Because of the timing of the most recent occurrence of the event on my system, I had 8 jobs in this stalled state. Most interestingly, I discovered that stopping just one of these jobs (as in clicking the red X icon in the Build Executor Status section) caused not only that one job to terminate, 4 others went with it. When I stopped one of the remaining 3 jobs, they all disappeared. Perhaps this is a clue. I have attached a thread dump file.
          Hide
          randallrschulz Randall Schulz added a comment -

          A thread dump of the Jenkins master process when there were 8 stalled (slave) jobs.

          [Apparently I did not succeed last time.]

          Show
          randallrschulz Randall Schulz added a comment - A thread dump of the Jenkins master process when there were 8 stalled (slave) jobs. [Apparently I did not succeed last time.]
          Hide
          kohsuke Kohsuke Kawaguchi added a comment -

          In the thread dump, #2284 is waiting for the build to complete, then later concurrent builds are waiting for that to finish before it sends out its failure notification.

          Show
          kohsuke Kohsuke Kawaguchi added a comment - In the thread dump, #2284 is waiting for the build to complete, then later concurrent builds are waiting for that to finish before it sends out its failure notification.
          Hide
          kohsuke Kohsuke Kawaguchi added a comment -

          #2284 is waiting for the shell script build step to complete. Later builds are past that point, but those are blocked at the point of e-mail notification, waiting for the previous build to complete (or else it won't be able to decide what to send out in the e-mail.)

          So at this point, my reading of this is that this is not a bug in Jenkins but simply a hang in the user script, but if you suspect otherwise, please wait for the next hang to occur, then obtain the thread dump both on the master and the slave, and check what the first blocked build is doing.

          As I've explained above, concurrent executions of the same job can block if the earlier build is blocked, so we need to focus on the root cause, which is why the first of the blocked builds is blocking.

          Show
          kohsuke Kohsuke Kawaguchi added a comment - #2284 is waiting for the shell script build step to complete. Later builds are past that point, but those are blocked at the point of e-mail notification, waiting for the previous build to complete (or else it won't be able to decide what to send out in the e-mail.) So at this point, my reading of this is that this is not a bug in Jenkins but simply a hang in the user script, but if you suspect otherwise, please wait for the next hang to occur, then obtain the thread dump both on the master and the slave, and check what the first blocked build is doing. As I've explained above, concurrent executions of the same job can block if the earlier build is blocked, so we need to focus on the root cause, which is why the first of the blocked builds is blocking.
          Hide
          smorriso Stephen Morrison added a comment -

          I have seen the same thing. One job of type A hanging (because of a problem in the script) causes all other jobs of type A to hang on completion. This only happens for me when all the Job As were triggered from the same Job B. i.e. Job B triggers 8 Job As. One Job A hangs. All the other Job As will also hang even though they have completed successfully.

          I think that's a bug, jobs should be able to run in parallel without having dependencies on the success (or otherwise) of other jobs of the same type.

          Show
          smorriso Stephen Morrison added a comment - I have seen the same thing. One job of type A hanging (because of a problem in the script) causes all other jobs of type A to hang on completion. This only happens for me when all the Job As were triggered from the same Job B. i.e. Job B triggers 8 Job As. One Job A hangs. All the other Job As will also hang even though they have completed successfully. I think that's a bug, jobs should be able to run in parallel without having dependencies on the success (or otherwise) of other jobs of the same type.
          Hide
          sjagoe Simon Jagoe added a comment - - edited

          I am working on a system that uses COM to communicate with a backend service. There are some test cases that hang in certain rare cases when using COM, which cause other (concurrent) builds of the same project to not complete, hanging in the build post-processing.

          While we are working on trying to prevent hanging in our test cases, I do not think Jenkins should be waiting before sending emails. It makes Jenkins not very robust against problematic builds. In my opinion the build system should expect problems in the user scripts and be able to handle that.

          Show
          sjagoe Simon Jagoe added a comment - - edited I am working on a system that uses COM to communicate with a backend service. There are some test cases that hang in certain rare cases when using COM, which cause other (concurrent) builds of the same project to not complete, hanging in the build post-processing. While we are working on trying to prevent hanging in our test cases, I do not think Jenkins should be waiting before sending emails. It makes Jenkins not very robust against problematic builds. In my opinion the build system should expect problems in the user scripts and be able to handle that.
          Hide
          hanabishi Marcus Jacobsson added a comment -

          This could be related to the problem we have with hanging slaves.
          The problem we had was that Jenkins noticed that the program it is running has ended and thus thinks that the program has a bug where it won't send the eof tag. Due to this it wants to write back a message about a wiki post.
          ""String msg = "Process leaked file descriptors. See http://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build for more information";
          Throwable e = new Exception().fillInStackTrace();
          LOGGER.log(Level.WARNING,msg,e);""
          But at this point the logger connection is deadlocked and the job will hang. To prevent the hanging we've patched our version of Jenkins with https://github.com/jenkinsci/jenkins/pull/665.

          Show
          hanabishi Marcus Jacobsson added a comment - This could be related to the problem we have with hanging slaves. The problem we had was that Jenkins noticed that the program it is running has ended and thus thinks that the program has a bug where it won't send the eof tag. Due to this it wants to write back a message about a wiki post. ""String msg = "Process leaked file descriptors. See http://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build for more information"; Throwable e = new Exception().fillInStackTrace(); LOGGER.log(Level.WARNING,msg,e);"" But at this point the logger connection is deadlocked and the job will hang. To prevent the hanging we've patched our version of Jenkins with https://github.com/jenkinsci/jenkins/pull/665 .

            People

            • Assignee:
              Unassigned
              Reporter:
              cwright Clark Wright
            • Votes:
              6 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated: