Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45057

"too many files open": file handles leak, job output file not closed

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Jenkins seems to keep a open file handle to the log file (job output) for every single build, even those who have been discarded by the "Discard old build policy".

       

      This is a sample of the lsof output (whole file attached)

      java 8870 jenkins 941w REG 252,0 1840 1332171 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50063/log (deleted)
      java 8870 jenkins 942w REG 252,0 2023 402006 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50044/log (deleted)
      java 8870 jenkins 943w REG 252,0 2193 1332217 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50101/log
      java 8870 jenkins 944w REG 252,0 2512 1332247 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50106/log
      java 8870 jenkins 945w REG 252,0 1840 1703994 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50067/log (deleted)
      java 8870 jenkins 946w REG 252,0 2350 1332230 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50092/log (deleted)
      java 8870 jenkins 947w REG 252,0 1840 402034 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50049/log (deleted)
      java 8870 jenkins 948w REG 252,0 1840 927855 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50080/log (deleted)
      java 8870 jenkins 949w REG 252,0 2195 1332245 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50095/log (deleted)
      java 8870 jenkins 950w REG 252,0 2326 1332249 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50107/log
      java 8870 jenkins 952w REG 252,0 2195 1332227 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50102/log
      java 8870 jenkins 953w REG 252,0 2154 1332254 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50109/log
      java 8870 jenkins 954w REG 252,0 2356 1332282 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50105/log
      

       

        Attachments

          Issue Links

            Activity

            Hide
            danielbeck Daniel Beck added a comment -

            Right, the Stapler one is tracked in JENKINS-45903.

            Show
            danielbeck Daniel Beck added a comment - Right, the Stapler one is tracked in JENKINS-45903 .
            Hide
            stevenatcisco Steven Christenson added a comment -

            Oleg Nenashev: We tried using the File Leak Detector Plugin... it would not run, apparently it requires Oracle Java - we are using OpenJDK. The kohsuke leak detector when run crashed our Jenkins instance. It too seems to require Oracle Java.

            Here is the job we are running hourly, and the results

            {{ /* JOB TO PERIODICALLY CHECK FILE HANDLES */}}node('master') {
            {{ sh '''rm -f lsof.txt }}
            {{ lsof -u jenkins > lsof.txt}}
            {{ cut -f 1 /proc/sys/fs/file-nr > filehandles.txt}}
            {{ echo "$(cat filehandles.txt)=handles |" > numfiles.txt}}
            {{ echo "$(wc -l < lsof.txt)=JenkLSOF |" >> numfiles.txt}}
            {{ echo "$(grep -Fc \'(deleted)\' lsof.txt)=deleted " >> numfiles.txt}}
            {{ cat numfiles.txt}}
            {{ '''}}
            {{ archiveArtifacts allowEmptyArchive: true, artifacts: '*.txt', caseSensitive: false}}
            {{ result = readFile 'numfiles.txt'}}
            {{ currentBuild.description = result}}
            {{ fileHandlesInUse = readFile 'filehandles.txt'}}
            {{ deleteDir()}}
            {{ } // node}}

            {{/******* RESULTS *******/ }}
            {{ Aug 30, 2017 6:56 AM 9472=handles | 10554=JenkLSOF | 3621=deleted}}
            {{ Aug 30, 2017 5:56 AM 9568=handles | 10654=JenkLSOF | 3557=deleted}}
            {{ Aug 30, 2017 4:56 AM 9376=handles | 10521=JenkLSOF | 3524=deleted}}
            {{ Aug 30, 2017 3:56 AM 9312=handles | 10417=JenkLSOF | 3462=deleted}}
            {{ Aug 30, 2017 2:56 AM 9216=handles | 10358=JenkLSOF | 3401=deleted}}
            {{ Aug 30, 2017 1:56 AM 9184=handles | 10276=JenkLSOF | 3338=deleted}}
            {{ Aug 30, 2017 12:56 AM 9312=handles | 10406=JenkLSOF | 3303=deleted}}
            {{ Aug 29, 2017 11:56 PM 9216=handles | 10338=JenkLSOF | 3236=deleted}}
            {{ Aug 29, 2017 10:56 PM 9408=handles | 10423=JenkLSOF | 3198=deleted}}
            {{ Aug 29, 2017 9:56 PM 8896=handles | 10042=JenkLSOF | 3137=deleted}}
            {{ Aug 29, 2017 8:56 PM 9024=handles | 10138=JenkLSOF | 3098=deleted}}
            {{ Aug 29, 2017 7:56 PM 9024=handles | 10243=JenkLSOF | 3028=deleted}}
            {{ Aug 29, 2017 6:56 PM 8896=handles | 9948=JenkLSOF | 2981=deleted}}
            {{ Aug 29, 2017 5:56 PM 8768=handles | 9879=JenkLSOF | 2913=deleted}}
            {{ Aug 29, 2017 4:56 PM 8832=handles | 9879=JenkLSOF | 2844=deleted}}
            {{ Aug 29, 2017 3:56 PM 8608=handles | 9731=JenkLSOF | 2773=deleted}}
            {{ Aug 29, 2017 2:56 PM 8448=handles | 9587=JenkLSOF | 2741=deleted}}
            {{ Aug 29, 2017 1:56 PM 8384=handles | 9556=JenkLSOF | 2681=deleted}}
            {{ Aug 29, 2017 12:56 PM 8192=handles | 9452=JenkLSOF | 2650=deleted}}
            {{ Aug 29, 2017 11:56 AM 8096=handles | 9306=JenkLSOF | 2590=deleted}}
            {{ Aug 29, 2017 1:56 AM 8064=handles | 8921=JenkLSOF | 2081=deleted}}

            The "deleted" items are all log entries like those described in the original incident. 

            NOTE: I have opened an incident under our support contract, but have posted details here in case they may help to diagnose the root cause.  Is there another tool we can use?  Or would the LSOF output over many hours be sufficient?

            Show
            stevenatcisco Steven Christenson added a comment - Oleg Nenashev : We tried using the File Leak Detector Plugin... it would not run, apparently it requires Oracle Java - we are using OpenJDK. The kohsuke leak detector when run crashed our Jenkins instance. It too seems to require Oracle Java. Here is the job we are running hourly, and the results {{ /* JOB TO PERIODICALLY CHECK FILE HANDLES */}} node('master') { {{ sh '''rm -f lsof.txt }} {{ lsof -u jenkins > lsof.txt}} {{ cut -f 1 /proc/sys/fs/file-nr > filehandles.txt}} {{ echo "$(cat filehandles.txt)=handles |" > numfiles.txt}} {{ echo "$(wc -l < lsof.txt)=JenkLSOF |" >> numfiles.txt}} {{ echo "$(grep -Fc \'(deleted)\' lsof.txt)=deleted " >> numfiles.txt}} {{ cat numfiles.txt}} {{ '''}} {{ archiveArtifacts allowEmptyArchive: true, artifacts: '*.txt', caseSensitive: false}} {{ result = readFile 'numfiles.txt'}} {{ currentBuild.description = result}} {{ fileHandlesInUse = readFile 'filehandles.txt'}} {{ deleteDir()}} {{ } // node}} {{/******* RESULTS *******/ }} {{ Aug 30, 2017 6:56 AM 9472=handles | 10554=JenkLSOF | 3621=deleted}} {{ Aug 30, 2017 5:56 AM 9568=handles | 10654=JenkLSOF | 3557=deleted}} {{ Aug 30, 2017 4:56 AM 9376=handles | 10521=JenkLSOF | 3524=deleted}} {{ Aug 30, 2017 3:56 AM 9312=handles | 10417=JenkLSOF | 3462=deleted}} {{ Aug 30, 2017 2:56 AM 9216=handles | 10358=JenkLSOF | 3401=deleted}} {{ Aug 30, 2017 1:56 AM 9184=handles | 10276=JenkLSOF | 3338=deleted}} {{ Aug 30, 2017 12:56 AM 9312=handles | 10406=JenkLSOF | 3303=deleted}} {{ Aug 29, 2017 11:56 PM 9216=handles | 10338=JenkLSOF | 3236=deleted}} {{ Aug 29, 2017 10:56 PM 9408=handles | 10423=JenkLSOF | 3198=deleted}} {{ Aug 29, 2017 9:56 PM 8896=handles | 10042=JenkLSOF | 3137=deleted}} {{ Aug 29, 2017 8:56 PM 9024=handles | 10138=JenkLSOF | 3098=deleted}} {{ Aug 29, 2017 7:56 PM 9024=handles | 10243=JenkLSOF | 3028=deleted}} {{ Aug 29, 2017 6:56 PM 8896=handles | 9948=JenkLSOF | 2981=deleted}} {{ Aug 29, 2017 5:56 PM 8768=handles | 9879=JenkLSOF | 2913=deleted}} {{ Aug 29, 2017 4:56 PM 8832=handles | 9879=JenkLSOF | 2844=deleted}} {{ Aug 29, 2017 3:56 PM 8608=handles | 9731=JenkLSOF | 2773=deleted}} {{ Aug 29, 2017 2:56 PM 8448=handles | 9587=JenkLSOF | 2741=deleted}} {{ Aug 29, 2017 1:56 PM 8384=handles | 9556=JenkLSOF | 2681=deleted}} {{ Aug 29, 2017 12:56 PM 8192=handles | 9452=JenkLSOF | 2650=deleted}} {{ Aug 29, 2017 11:56 AM 8096=handles | 9306=JenkLSOF | 2590=deleted}} {{ Aug 29, 2017 1:56 AM 8064=handles | 8921=JenkLSOF | 2081=deleted}} The "deleted" items are all log entries like those described in the original incident.  NOTE: I have opened an incident under our support contract, but have posted details here in case they may help to diagnose the root cause.  Is there another tool we can use?  Or would the LSOF output over many hours be sufficient?
            Hide
            stevenatcisco Steven Christenson added a comment -

            Here is confirmation that the upgrade resolved the leak... mostly.

            We notice in the last 48 hours, there have been 6 file handle leaks. That would have been 100s previously.

            Show
            stevenatcisco Steven Christenson added a comment - Here is confirmation that the upgrade resolved the leak... mostly. We notice in the last 48 hours, there have been 6 file handle leaks. That would have been 100s previously.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Even 6 leaks is quite suspicious, but I'd guess we cannot do anything with it without File Leak Detector

            Show
            oleg_nenashev Oleg Nenashev added a comment - Even 6 leaks is quite suspicious, but I'd guess we cannot do anything with it without File Leak Detector
            Hide
            wheleph Volodymyr Sobotovych added a comment -

            Oleg Nenashev After upgrade to Jenkins 2.73.3 the issue became less severe but still we have to restart our Jenkins instance once a week (for 2.60 it was once a day).

            Here's the summary of 2 lsof runs with 1 day between them. The list of top files:

            Nov-17:

            100632 slave.log
            32294 log
            7685 timestamps
            4193 random
            3635 urandom

            Nov-18:

            708532 log
            297707 timestamps
            98280 slave.log
            90675 Common.groovy
            85995 BobHelper.groovy
            

            Does it give you more information to find the cause? Unfortunately it's a bit hard for me to provide the file leak detector plugin output because we use openjdk

            Show
            wheleph Volodymyr Sobotovych added a comment - Oleg Nenashev After upgrade to Jenkins 2.73.3 the issue became less severe but still we have to restart our Jenkins instance once a week (for 2.60 it was once a day). Here's the summary of 2 lsof runs with 1 day between them. The list of top files: Nov-17: 100632 slave.log 32294 log 7685 timestamps 4193 random 3635 urandom Nov-18: 708532 log 297707 timestamps 98280 slave.log 90675 Common.groovy 85995 BobHelper.groovy Does it give you more information to find the cause? Unfortunately it's a bit hard for me to provide the file leak detector plugin output because we use openjdk

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                bbonacci Bruno Bonacci
              • Votes:
                13 Vote for this issue
                Watchers:
                29 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: