Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27922

Jenkins job execution becomes unstable - jobs fail with OOM: unable to create new native thread

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Critical
    • Resolution: Duplicate
    • Component/s: ssh-agent-plugin
    • Environment:
    • Similar Issues:

      Description

      After running for 2-3 days, jenkins jobs no longer launch.

      The console outputs usually just say that fetching from git failed, but sometimes contain other unusual errors.

      The system log for jenkins reports

      java.lang.OutOfMemoryError: unable to create new native thread

      I was able to get a heap dump but due to the potential inclusion of sensitive data cannot post it.

      In VisualVM analysis of the heap dump, I noticed that there are almost 1000 instances of AgentServer and AgentServer$1. The threads don't show up in the thread monitor, but are still referenced somehow.

      Unfortunately the parent references are numerous and hard to decipher. The proximate parent is the ThreadGroup.threads array in the main ThreadGroup instance. This seems unlikely to be the true root cause.

      I also noticed about the same number of ThreadLocalMap instances, so the leak may be related to incorrect use of ThreadLocal.

      Attached a screenshot of the AgentServer$1 instances in VisualVM, and the jenkins system log.

      Please let me know if there is any other analysis I can provide.

      I am entering this bug as blocker because I don't currently have a workaround. I am using jenkins in conjunction with an external php application that needs to post jobs to the jenkins build queue. Therefore, in order to workaround, I need to implement a controlled shutdown process and restart jenkins at a daily or semi-daily interval. This will ultimately require the calling application to retry, which is probably a good idea anyway, but is not yet implemented.

        Attachments

        1. jenkins.log
          146 kB
        2. plugins.xml
          3 kB
        3. Screen Shot 2015-04-13 at 11.02.27 AM.png
          Screen Shot 2015-04-13 at 11.02.27 AM.png
          314 kB
        4. Thread dump [Jenkins].html
          485 kB
        5. Thread dump [Jenkins].html
          348 kB

          Issue Links

            Activity

            Hide
            jamie Jamie Doornbos added a comment -

            I do generally see the Stopped line on builds, but I don't watch every build. I checked using a grep on files no more than 3 days old and found a small discrepancy of 20 stray "Started" lines:

            [/var/lib/jenkins/jobs]$ find . -mtime -3 -type f > /tmp/recent_logs
            [/var/lib/jenkins/jobs]$ grep -l '[ssh-agent] Started.' `cat /tmp/recent_logs` > /tmp/agent-started-logs
            [/var/lib/jenkins/jobs]$ grep -l '[ssh-agent] Stopped.' `cat /tmp/recent_logs` > /tmp/agent-stopped-logs
            [/var/lib/jenkins/jobs]$ wc -l /tmp/agent-st*
            2578 /tmp/agent-started-logs
            2558 /tmp/agent-stopped-logs
            5136 total

            Regarding use of SSH Agent, it is configured for all builds, since the git plugin fails to work in my environment if SSH Agent is not running. (I spent a few hours trying to debug this months ago, but don't really remember the details.) Most of the builds don't require an agent other than for the git plugin.

            Show
            jamie Jamie Doornbos added a comment - I do generally see the Stopped line on builds, but I don't watch every build. I checked using a grep on files no more than 3 days old and found a small discrepancy of 20 stray "Started" lines: [/var/lib/jenkins/jobs] $ find . -mtime -3 -type f > /tmp/recent_logs [/var/lib/jenkins/jobs] $ grep -l '[ssh-agent] Started.' `cat /tmp/recent_logs` > /tmp/agent-started-logs [/var/lib/jenkins/jobs] $ grep -l '[ssh-agent] Stopped.' `cat /tmp/recent_logs` > /tmp/agent-stopped-logs [/var/lib/jenkins/jobs] $ wc -l /tmp/agent-st* 2578 /tmp/agent-started-logs 2558 /tmp/agent-stopped-logs 5136 total Regarding use of SSH Agent, it is configured for all builds, since the git plugin fails to work in my environment if SSH Agent is not running. (I spent a few hours trying to debug this months ago, but don't really remember the details.) Most of the builds don't require an agent other than for the git plugin.
            Hide
            jamie Jamie Doornbos added a comment - - edited

            This is becoming more serious for us as we approach full rollout. Currently, jenkins needs to be restarted every 6 hours and this is sometimes not enough. It also means builds that take longer than 6 hours currently have to be run out of band (from a separate command shell). This means we may be forced to replace jenkins at the last minute, which would make me sad.

            BUT... my employer want to sponsor this issue. How much money would be a good enough incentive? I suggested $500. Do you have any reproducible case or some idea of how to fix? Would it be okay to state the terms as something like "my jenkins instance does not fail due to SSH Agent after 2 days"?

            Show
            jamie Jamie Doornbos added a comment - - edited This is becoming more serious for us as we approach full rollout. Currently, jenkins needs to be restarted every 6 hours and this is sometimes not enough. It also means builds that take longer than 6 hours currently have to be run out of band (from a separate command shell). This means we may be forced to replace jenkins at the last minute, which would make me sad. BUT... my employer want to sponsor this issue. How much money would be a good enough incentive? I suggested $500. Do you have any reproducible case or some idea of how to fix? Would it be okay to state the terms as something like "my jenkins instance does not fail due to SSH Agent after 2 days"?
            Hide
            danielbeck Daniel Beck added a comment -

            I don't have the time right now to work on any bounties (plus I got burned in the past). I'm doing some issue triaging for the Jenkins project, which is my only interest in this specific issue. I am not a developer of the SSH Agent Plugin, nor do I use it myself.

            Maybe try the jenkinsci-users mailing list about your Git Plugin issue.

            Show
            danielbeck Daniel Beck added a comment - I don't have the time right now to work on any bounties (plus I got burned in the past). I'm doing some issue triaging for the Jenkins project, which is my only interest in this specific issue. I am not a developer of the SSH Agent Plugin, nor do I use it myself. Maybe try the jenkinsci-users mailing list about your Git Plugin issue.
            Hide
            danielbeck Daniel Beck added a comment -

            Seems to be the plugin and not an issue in core.

            Show
            danielbeck Daniel Beck added a comment - Seems to be the plugin and not an issue in core.
            Hide
            danielbeck Daniel Beck added a comment -

            Looks like a duplicate of JENKINS-27555 that was fixed in SSH Agent 1.7.

            Show
            danielbeck Daniel Beck added a comment - Looks like a duplicate of JENKINS-27555 that was fixed in SSH Agent 1.7.

              People

              • Assignee:
                Unassigned
                Reporter:
                jamie Jamie Doornbos
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: