Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
      None
    • Similar Issues:

      Description

      This is to track the problem originally reported here: http://n4.nabble.com/Polling-hung-td1310838.html#a1310838
      The referenced thread is relocated to http://jenkins.361315.n4.nabble.com/Polling-hung-td1310838.html

      What the problem boils down to is that many remote operations are performed synchronously causing the channel object to be locked while a response returns. In situations where a lengthy remote operations is using the channel, SCM polling can be blocked waiting for the monitor on the channel to be released. In extreme situations, all the polling threads can wind up waiting on object monitors for the channel objects, preventing further processing of polling tasks.

      Furthermore, if the slave dies, the locked channel object still exists in the master JVM. If no IOException is thrown to indicate the termination of the connection to the pipe, the channel can never be closed because Channel.close() itself is a sychronized operation.

        Attachments

        1. DUMP1.txt
          57 kB
        2. hung_scm_pollers_02.PNG
          hung_scm_pollers_02.PNG
          145 kB
        3. thread_dump_02.txt
          92 kB
        4. threads.vetted.txt
          163 kB

          Issue Links

            Activity

            Hide
            meolsen Morten Engelhardt Olsen added a comment - - edited

            At Atmel we're now managing this issue by having the following system groovy script run every couple of minutes to monitor the processor load:

            import java.lang.management.*;
            
            def threadBean = ManagementFactory.getThreadMXBean();
            def osBean     = ManagementFactory.getOperatingSystemMXBean();
            
            println "\n\n\n[Checking state of (master)]";
            
            println "Current CPU Time used by Jenkins: " + threadBean.getCurrentThreadCpuTime() + "ns";
            
            double processLoad = (osBean.getProcessCpuLoad() * 100).round(2);
            double cpuLoad = (osBean.getSystemCpuLoad() * 100).round(2);
            println "Process CPU Load: " + processLoad + "%";
            println "CPU Load: " + cpuLoad + "%";
            
            if (processLoad < 90) {
              println "\n\n\n === Load is less than 90%, nothing to do ===\n\n\n";
              println "\n\n\n[Done checking: CPU Load: " + cpuLoad + "%]\n\n\n";
              return;
            } else {
              println "\n\n\n === Load is more than 90%, checking for stuck threads! ===\n\n\n";
            }
            
            
            println "\n\n\n[Checking all threads]\n\n\n";
            def threadNum = 0;
            def killThreadNum = 0;
            
            def stacktraces = Thread.getAllStackTraces();
            stacktraces.each { thread, stack ->
              if (thread.getName().contains("trigger/TimerTrigger/check") ) {
                println "=== Interrupting thread " + thread.getName()+ " ===";
                thread.interrupt();
                killThreadNum++;
              }
              threadNum++;
            }
            
            println "\n\n\n[Done checking: " + threadNum + " threads, killed " + killThreadNum + "]\n\n\n";
            
            return; // Suppress groovy state dump

            Note that we had to check for TimerTrigger, not SCM Polling as the original code did. This is currently running on 1.580.2.

            Show
            meolsen Morten Engelhardt Olsen added a comment - - edited At Atmel we're now managing this issue by having the following system groovy script run every couple of minutes to monitor the processor load: import java.lang.management.*; def threadBean = ManagementFactory.getThreadMXBean(); def osBean = ManagementFactory.getOperatingSystemMXBean(); println "\n\n\n[Checking state of (master)]" ; println "Current CPU Time used by Jenkins: " + threadBean.getCurrentThreadCpuTime() + "ns" ; double processLoad = (osBean.getProcessCpuLoad() * 100).round(2); double cpuLoad = (osBean.getSystemCpuLoad() * 100).round(2); println " Process CPU Load: " + processLoad + "%" ; println "CPU Load: " + cpuLoad + "%" ; if (processLoad < 90) { println "\n\n\n === Load is less than 90%, nothing to do ===\n\n\n" ; println "\n\n\n[Done checking: CPU Load: " + cpuLoad + "%]\n\n\n" ; return ; } else { println "\n\n\n === Load is more than 90%, checking for stuck threads! ===\n\n\n" ; } println "\n\n\n[Checking all threads]\n\n\n" ; def threadNum = 0; def killThreadNum = 0; def stacktraces = Thread .getAllStackTraces(); stacktraces.each { thread, stack -> if (thread.getName().contains( "trigger/TimerTrigger/check" ) ) { println "=== Interrupting thread " + thread.getName()+ " ===" ; thread.interrupt(); killThreadNum++; } threadNum++; } println "\n\n\n[Done checking: " + threadNum + " threads, killed " + killThreadNum + "]\n\n\n" ; return ; // Suppress groovy state dump Note that we had to check for TimerTrigger , not SCM Polling as the original code did. This is currently running on 1.580.2.
            Hide
            ndirons Nathaniel Irons added a comment -

            The script provided on Jan 13 seems to be solving a different problem. On our instance, we see stuck SCM polling threads even when the CPU load is zero. With three SCM polling processes stuck as of this moment, the thread names reported by Thread.getAllStackTraces() are main, Finalizer, Signal Dispatcher, and Reference Handler.

            I'm pig-ignorant of groovy, and have yet to figure out where its access to Jenkins thread innards are documented, but previous iterations of scripts that did identify a stuck thread to interrupt were ineffective for us — we've yet to find an effective workaround that doesn't rely on restarting the jenkins daemon.

            We're using 1.590, and looking to switch to LTS releases as soon as they pass us by.

            Show
            ndirons Nathaniel Irons added a comment - The script provided on Jan 13 seems to be solving a different problem. On our instance, we see stuck SCM polling threads even when the CPU load is zero. With three SCM polling processes stuck as of this moment, the thread names reported by Thread.getAllStackTraces() are main, Finalizer, Signal Dispatcher, and Reference Handler. I'm pig-ignorant of groovy, and have yet to figure out where its access to Jenkins thread innards are documented, but previous iterations of scripts that did identify a stuck thread to interrupt were ineffective for us — we've yet to find an effective workaround that doesn't rely on restarting the jenkins daemon. We're using 1.590, and looking to switch to LTS releases as soon as they pass us by.
            Hide
            ahoffmann Andrew Hoffmann added a comment -

            We are experiencing git polling getting hung as well. We have ~15 jobs that poll every 5 minutes. It gets hung roughly 24 hours after a service restart. We also have the BitBucket pull request builder polling every 5 minutes for another ~15 jobs.

            Jenkins v1.622
            git plugin 2.4.0
            git-client plugin 1.18.0
            bitbucket-pullrequest-builder plugin 1.4.7

            0-30 minutes prior to being hung, I see this exception:

            WARNING: Process leaked file descriptors. See http://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build for more information
            java.lang.Exception
            	at hudson.Proc$LocalProc.join(Proc.java:329)
            	at hudson.Proc.joinWithTimeout(Proc.java:168)
            	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1596)
            	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1576)
            	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1572)
            	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1233)
            	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$4.execute(CliGitAPIImpl.java:583)
            	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1310)
            	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1261)
            	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1252)
            	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.getHeadRev(CliGitAPIImpl.java:2336)
            	at hudson.plugins.git.GitSCM.compareRemoteRevisionWithImpl(GitSCM.java:583)
            	at hudson.plugins.git.GitSCM.compareRemoteRevisionWith(GitSCM.java:527)
            	at hudson.scm.SCM.compareRemoteRevisionWith(SCM.java:381)
            	at hudson.scm.SCM.poll(SCM.java:398)
            	at hudson.model.AbstractProject._poll(AbstractProject.java:1461)
            	at hudson.model.AbstractProject.poll(AbstractProject.java:1364)
            	at jenkins.triggers.SCMTriggerItem$SCMTriggerItems$Bridge.poll(SCMTriggerItem.java:119)
            	at hudson.triggers.SCMTrigger$Runner.runPolling(SCMTrigger.java:510)
            	at hudson.triggers.SCMTrigger$Runner.run(SCMTrigger.java:539)
            	at hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:118)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
            	at java.util.concurrent.FutureTask.run(Unknown Source)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            	at java.lang.Thread.run(Unknown Source)
            

            I will be happy to provide more configuration details and logs if requested.

            Show
            ahoffmann Andrew Hoffmann added a comment - We are experiencing git polling getting hung as well. We have ~15 jobs that poll every 5 minutes. It gets hung roughly 24 hours after a service restart. We also have the BitBucket pull request builder polling every 5 minutes for another ~15 jobs. Jenkins v1.622 git plugin 2.4.0 git-client plugin 1.18.0 bitbucket-pullrequest-builder plugin 1.4.7 0-30 minutes prior to being hung, I see this exception: WARNING: Process leaked file descriptors. See http: //wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build for more information java.lang.Exception at hudson.Proc$LocalProc.join(Proc.java:329) at hudson.Proc.joinWithTimeout(Proc.java:168) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1596) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1576) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1572) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1233) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$4.execute(CliGitAPIImpl.java:583) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1310) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1261) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1252) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.getHeadRev(CliGitAPIImpl.java:2336) at hudson.plugins.git.GitSCM.compareRemoteRevisionWithImpl(GitSCM.java:583) at hudson.plugins.git.GitSCM.compareRemoteRevisionWith(GitSCM.java:527) at hudson.scm.SCM.compareRemoteRevisionWith(SCM.java:381) at hudson.scm.SCM.poll(SCM.java:398) at hudson.model.AbstractProject._poll(AbstractProject.java:1461) at hudson.model.AbstractProject.poll(AbstractProject.java:1364) at jenkins.triggers.SCMTriggerItem$SCMTriggerItems$Bridge.poll(SCMTriggerItem.java:119) at hudson.triggers.SCMTrigger$Runner.runPolling(SCMTrigger.java:510) at hudson.triggers.SCMTrigger$Runner.run(SCMTrigger.java:539) at hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:118) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source) I will be happy to provide more configuration details and logs if requested.
            Hide
            ymartin1040 Yves Martin added a comment - - edited

            When investigating a Subversion SCM polling issue (JENKINS-31192), I find out that a global lock hudson.scm.SubversionSCM$ModuleLocation prevents threads to work concurrently. Is that "big lock" really relevant ? Maybe it is possible to reduce the code section when the lock is held.

            Show
            ymartin1040 Yves Martin added a comment - - edited When investigating a Subversion SCM polling issue ( JENKINS-31192 ), I find out that a global lock hudson.scm.SubversionSCM$ModuleLocation prevents threads to work concurrently. Is that "big lock" really relevant ? Maybe it is possible to reduce the code section when the lock is held.
            Hide
            zxkane Meng Xin Zhu added a comment -

            Still happening on Jenkins LTS 2.19.4.

            My job is periodically polling from git repo(every 5 minutes). However the scm polling might hang infinitely without timeout. The subsequent manually job building also is blocked by scm polling. It's definitely a critical issue to impact the usability of jenkins.

            Show
            zxkane Meng Xin Zhu added a comment - Still happening on Jenkins LTS 2.19.4. My job is periodically polling from git repo(every 5 minutes). However the scm polling might hang infinitely without timeout. The subsequent manually job building also is blocked by scm polling. It's definitely a critical issue to impact the usability of jenkins.

              People

              • Assignee:
                Unassigned
                Reporter:
                dty Dean Yu
              • Votes:
                136 Vote for this issue
                Watchers:
                141 Start watching this issue

                Dates

                • Created:
                  Updated: