Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-52150

High CPU sys time consumption after upgrade to 2.107.3

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      We have several heavily loaded instances of Jenkins. The most biggest of them builds around 30 000 jobs per day (two others - around 6000).

      After upgrade to Jenkins 2.107.3 on the most heavily loaded instance we see a hight CPU sys time consumption. For users this looks like that the Jenkins works fast, but jobs build slow and the last step "Waiting for Jenkins to finish collecting data" takes a very long time (very similar to https://issues.jenkins-ci.org/browse/JENKINS-19712).

      Moreover, some jobs simply looped at random steps:

      [FS_PR_INIT] Running shell script
      Cannot contact jenkins-agent-linux-007: java.lang.InterruptedException
      + curl --user '****:****' -s -k -m 60 https://sbt-jenkins.ca.sbrf.ru/jenkins/queue/api/json
      + curl --user '****:****' -s -k -m 60 https://sbt-jenkins.ca.sbrf.ru/jenkins/queue/api/json
      + curl --user '****:****' -s -k -m 60 https://sbt-jenkins.ca.sbrf.ru/jenkins/queue/api/json
      + curl --user '****:****' -s -k -m 60 https://sbt-jenkins.ca.sbrf.ru/jenkins/queue/api/json
      + curl --user '****:****' -s -k -m 60 https://sbt-jenkins.ca.sbrf.ru/jenkins/queue/api/json
      + curl --user '****:****' -s -k -m 60 https://sbt-jenkins.ca.sbrf.ru/jenkins/queue/api/json
      + curl --user '****:****' -s -k -m 60 https://sbt-jenkins.ca.sbrf.ru/jenkins/queue/api/json
      + curl --user '****:****' -s -k -m 60 https://sbt-jenkins.ca.sbrf.ru/jenkins/queue/api/json
      

      After some investigation we have found that the processes, who consume CPU sys time most of all, are hanging on "futex" system call:

      % time     seconds  usecs/call     calls    errors syscall
      ------ ----------- ----------- --------- --------- ----------------
       74.31 13168.872885       68562    192072     38348 futex
       19.97 3539.132374     5998529       590       440 restart_syscall
        1.98  351.174121       24197     14513        18 recvfrom
        1.75  310.265366         865    358527      1466 write
        1.05  185.451445       59325      3126           poll
        0.18   31.986477        1823     17547      1006 read
        0.15   26.976110       33100       815           accept
        0.11   19.407311         982     19760      2760 stat
      

      In strace output for such processes we can see something like that:

       futex(0x7fc42800ba54, FUTEX_WAIT_BITSET_PRIVATE, 1, {5161981, 228340395}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
      futex(0x7fc42800ba28, FUTEX_WAKE_PRIVATE, 1) = 0
      futex(0x7fc42800ba54, FUTEX_WAIT_BITSET_PRIVATE, 1, {5161981, 278655215}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
      futex(0x7fc42800ba28, FUTEX_WAKE_PRIVATE, 1) = 0
      futex(0x7fc42800ba54, FUTEX_WAIT_BITSET_PRIVATE, 1, {5161981, 328848067}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
      futex(0x7fc42800ba28, FUTEX_WAKE_PRIVATE, 1) = 0
      futex(0x7fc42800ba54, FUTEX_WAIT_BITSET_PRIVATE, 1, {5161981, 379185329}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
      

      corresponding Java-process:

      "MVStore background writer nio:/u01/jenkins/jenkins_home/jenkins-jobs/jenkins-jobs.mv.db" #96324 daemon prio=5 os_prio=0 tid=0x00007fc3500a4000 nid=0x6048 in Object.wait() [0x
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
      <------>at java.lang.Object.wait(Native Method)
      <------>at org.h2.mvstore.MVStore$BackgroundWriterThread.run(MVStore.java:2715)
      <------>- locked <0x00000001d78dfc88> (a java.lang.Object)
      
         Locked ownable synchronizers:
      <------>- None
      

      or that:

      clock_gettime(CLOCK_THREAD_CPUTIME_ID, {10, 798549532}) = 0
      stat("/var/cache/jenkins/war/WEB-INF/classes/hudson/maven/MavenModuleSet.class", 0x7f4168ac8540) = -1 ENOENT (No such file or directory)
      stat("/var/cache/jenkins/war/WEB-INF/classes/hudson/maven/MavenModuleSet.class", 0x7f4168ac8560) = -1 ENOENT (No such file or directory)
      lstat("/u01", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
      

      Environment: RHEL 7.5 with the latest updates (except of kernel). DRBD+ext4. OpenStack KVM. 32 CPU 64 GB RAM.

      JENKINS_JAVA_OPTIONS="-Djava.awt.headless=true -Dgroovy.use.classvalue=true -Xms32g -Xmx32g -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 -XX:InitiatingHeapOccupancyPercent=75 -XX:MaxMetaspaceSize=3024m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/u01/jenkins -XX:+UnlockDiagnosticVMOptions -Djenkins.model.Jenkins.logStartupPerformance=true -Dhudson.plugins.git.GitStatus.allowNotifyCommitParameters=true -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Dhudson.model.DirectoryBrowserSupport.CSP=\"default-src 'self'; img-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; child-src 'self'; frame-src 'self';\" -Djenkins.model.DirectoryBrowserSupport.CSP=\"default-src 'self'; img-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; child-src 'self'; frame-src 'self';\" -Dorg.apache.commons.jelly.tags.fmt.timeZone=\"Europe/Moscow\" -Dhudson.model.User.SECURITY_243_FULL_DEFENSE=false -Dcom.cloudbees.workflow.rest.external.ChangeSetExt.resolveCommitAuthors=false -server -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -Xloggc:${JENKINS_HOME}/gc_logs/gc-%t.log -XX:NumberOfGCLogFiles=5 -XX:+UseGCLogFileRotation -XX:GCLogFileSize=20m -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintAdaptiveSizePolicy -Dcom.sun.management.jmxremote.port=3333 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.io.tmpdir=/u01/jenkins/tmp -Dhudson.remoting.ClassFilter=org.jenkinsci.lib.xtrigger.AbstractTrigger"
      
      root@jenkins-master-002:~$  uname -a
      Linux jenkins-master-002.vm.mos.cloud.sbrf.ru 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
      
      root@jenkins-master-002:~$  java -version
      java version "1.8.0_151"
      Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
      Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
      

      More diagnostic information in the attachments (we have created screenshots and ThreadDumps not at the same time, but at a time when we have seen the problem).

        Attachments

        1. 2.txt
          1.38 MB
        2. 3.txt
          1.49 MB
        3. 4.txt
          1.40 MB
        4. 5.txt
          1.44 MB
        5. fingerprint-add.png
          fingerprint-add.png
          121 kB
        6. image-2018-06-23-14-49-03-404.png
          image-2018-06-23-14-49-03-404.png
          319 kB
        7. perf-jenkins-master-002-16-41.PNG
          perf-jenkins-master-002-16-41.PNG
          61 kB
        8. strace-output.png
          strace-output.png
          96 kB

          Activity

          Hide
          capf Carsten Pfeiffer added a comment -

          So this is rather unrelated to the maven plugin and more of a Jenkins core issue. The maven plugin only brings up the issue because it creates fingerprints by default.

          Show
          capf Carsten Pfeiffer added a comment - So this is rather unrelated to the maven plugin and more of a Jenkins core issue. The maven plugin only brings up the issue because it creates fingerprints by default.
          Hide
          dkoryavov Denis Koryavov added a comment -

          We disabled automatic fingerprint archiving for all maven jobs and it worked. I thinks the priority can be changed to Critical, because we have a simply workaround.

          We'll try to use -Dhudson.util.AtomicFileWriter.DISABLE_FORCED_FLUSH=true, but I think the problem should be fixed, because this behaviour of Jenkins is real issue for a large instances.

          Show
          dkoryavov Denis Koryavov added a comment - We disabled automatic fingerprint archiving for all maven jobs and it worked. I thinks the priority can be changed to Critical, because we have a simply workaround. We'll try to use -Dhudson.util.AtomicFileWriter.DISABLE_FORCED_FLUSH=true, but I think the problem should be fixed, because this behaviour of Jenkins is real issue for a large instances.
          Hide
          capf Carsten Pfeiffer added a comment -

          One option would be to call FileChannel.force(false) instead of FileChannel.force(true) for the fingerprints in https://github.com/jenkinsci/jenkins/blame/jenkins-2.121.2/core/src/main/java/hudson/util/FileChannelWriter.java#L89 in order not to force-write all metadata. The manpage of fdatasync on Linux says: "The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.".
          Not sure how much this would help, though.

          Show
          capf Carsten Pfeiffer added a comment - One option would be to call FileChannel.force(false) instead of FileChannel.force(true) for the fingerprints in https://github.com/jenkinsci/jenkins/blame/jenkins-2.121.2/core/src/main/java/hudson/util/FileChannelWriter.java#L89 in order not to force-write all metadata. The manpage of fdatasync on Linux says: "The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.". Not sure how much this would help, though.
          Hide
          marco_jacob Marco Jacob added a comment -

          I'm not quite sure if this is a related problem:

          14 of 14 Build jobs in our Jenkins 2.107.3 are blocked in Fingerprint.add during collecting data from slaves.
          Both, master and two slaves running on Solaris.

           

          Show
          marco_jacob Marco Jacob added a comment - I'm not quite sure if this is a related problem: 14 of 14 Build jobs in our Jenkins 2.107.3 are blocked in Fingerprint.add during collecting data from slaves. Both, master and two slaves running on Solaris.  
          Hide
          akmjenkins ASHOK MOHANTY added a comment -

          We are using Jenkins v.2.121.3 (upgraded couple of months back from v.2.73.3 ). We have around ~1000 MBranch + ~200 pipeline + ~200FreeStyle projects.
          Post upgrade we are seeing CPU spike in the master server. We are not running any builds in master (primarily only MB indexings runs in master box). 

          Master Server - Linux 7.x (100+ cpu). We have gitLab* plugins for "SCM - GitLab" integration. GitLab hosted in separate Linux server. Can you please provide some tips to debug the issue !!

          Show
          akmjenkins ASHOK MOHANTY added a comment - We are using Jenkins v.2.121.3 (upgraded couple of months back from v.2.73.3 ). We have around ~1000 MBranch + ~200 pipeline + ~200FreeStyle projects. Post upgrade we are seeing CPU spike in the master server. We are not running any builds in master (primarily only MB indexings runs in master box).  Master Server - Linux 7.x (100+ cpu). We have gitLab* plugins for "SCM - GitLab" integration. GitLab hosted in separate Linux server. Can you please provide some tips to debug the issue !!

            People

            • Assignee:
              Unassigned
              Reporter:
              dkoryavov Denis Koryavov
            • Votes:
              5 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated: