Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58573

100% CPU remoting.jar or slave.jar on EC2 (connection refused)

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Blocker
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
      None
    • Environment:
      Jenkins 2.176.2
      ec2 1.44.1
    • Similar Issues:

      Description

      Jenkins EC2 nodes are constantly crashing with 100% cpu usage:

       

      java.util.concurrent.TimeoutException: Ping started at 1563548484233 hasn't completed by 1563548724233
        at hudson.remoting.PingThread.ping(PingThread.java:134)
        at hudson.remoting.PingThread.run(PingThread.java:90)

       

      I tried both using "native ssh" and via jenkins-ssh and both have the same issue. It looks like the remoting.jar is hung up:

       

       

      JvmTop 0.8.0 alpha - 14:58:59, amd64, 4 cpus, Linux 4.9.0-9-a, load avg 7.89
       http://code.google.com/p/jvmtop
      PID MAIN-CLASS HPCUR HPMAX NHCUR NHMAX CPU GC VM USERNAME #T DL
       4093 m.jvmtop.JvmTop 21m 1698m 18m n/a 0.50% 0.00% O8U21 root 12 
       3973 remoting.jar [ERROR: Connection refused/access denied] 
       6406 remoting.jar [ERROR: Connection refused/access denied]
      

       

       

      Not sure how to further debug this.

        Attachments

          Activity

          Hide
          raihaan Raihaan Shouhell added a comment -

          What version of java? I'm not sure if anyone from remoting can chime in on what makes the process spike to 100%

          Show
          raihaan Raihaan Shouhell added a comment - What version of java? I'm not sure if anyone from remoting can chime in on what makes the process spike to 100%
          Hide
          lifeofguenter Gunter Grodotzki added a comment -
          $ java -version
          openjdk version "1.8.0_222"
          OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1~deb9u1-b10)
          OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

          Even on a relatively strong machine (EC2 c5.2xlarge) it is happening. I am only connecting via internal IPs, so there is no firewall in between.

          Show
          lifeofguenter Gunter Grodotzki added a comment - $ java -version openjdk version "1.8.0_222" OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1~deb9u1-b10) OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode) Even on a relatively strong machine (EC2 c5.2xlarge) it is happening. I am only connecting via internal IPs, so there is no firewall in between.
          Hide
          lifeofguenter Gunter Grodotzki added a comment -

          Removing `ec2-plugin` as possible culprit. I launched a EC2 instance (same config) and manually attached it as a permanent node. Eventually it started crashing again.

           

          I am seeing the following in the logs:

           

          jenkins-slave.2.log:2019-07-23T13:16:10.220+0000 WARNING hudson.Proc$LocalProc join: Process leaked file descriptors. See https://jenkins.io/redirect/troubleshooting/process-leaked-file-descriptors for more information
          jenkins-slave.2.log:2019-07-23T13:28:54.652+0000 WARNING hudson.Proc$LocalProc join: Process leaked file descriptors. See https://jenkins.io/redirect/troubleshooting/process-leaked-file-descriptors for more information
          
          

          But while not ideal that should not cause remoting.jar to crash completely?

           

          I will try a slightly different setup with Debian 10 and OpenJDK11 to eliminate os issues.

           

          Show
          lifeofguenter Gunter Grodotzki added a comment - Removing `ec2-plugin` as possible culprit. I launched a EC2 instance (same config) and manually attached it as a permanent node. Eventually it started crashing again.   I am seeing the following in the logs:   jenkins-slave.2.log:2019-07-23T13:16:10.220+0000 WARNING hudson.Proc$LocalProc join: Process leaked file descriptors. See https: //jenkins.io/redirect/troubleshooting/process-leaked-file-descriptors for more information jenkins-slave.2.log:2019-07-23T13:28:54.652+0000 WARNING hudson.Proc$LocalProc join: Process leaked file descriptors. See https: //jenkins.io/redirect/troubleshooting/process-leaked-file-descriptors for more information But while not ideal that should not cause remoting.jar to crash completely?   I will try a slightly different setup with Debian 10 and OpenJDK11 to eliminate os issues.  
          Hide
          lifeofguenter Gunter Grodotzki added a comment - - edited

          Switched over to Ubuntu 18.04 with OpenJDK 11. Still need to do longer tests, seems a bit better but every now and then the remoting.jar will spike at 100% cpu and go down again - without any builds running. So the load never goes down to 0 - even though nothing is building.

          I am curious if this has something to do with the jenkins-master running behind cloudflare. But the nodes are connecting via the internal IP to the master, so this should not be an issue?

           

          Update: after some time it still gradually increases load. It seems also to affect the master analogously.

          Show
          lifeofguenter Gunter Grodotzki added a comment - - edited Switched over to Ubuntu 18.04 with OpenJDK 11. Still need to do longer tests, seems a bit better but every now and then the remoting.jar will spike at 100% cpu and go down again - without any builds running. So the load never goes down to 0 - even though nothing is building. I am curious if this has something to do with the jenkins-master running behind cloudflare. But the nodes are connecting via the internal IP to the master, so this should not be an issue?   Update: after some time it still gradually increases load. It seems also to affect the master analogously.
          Hide
          thoulen FABRIZIO MANFREDI added a comment -

          can you dump make a flight recording ? or a memdump of the master to check in the status of the master? 

          Are the Master and slave  with the same jdk ? 

          Is it happen only with ec2 ? 

          Show
          thoulen FABRIZIO MANFREDI added a comment - can you dump make a flight recording ? or a memdump of the master to check in the status of the master?  Are the Master and slave  with the same jdk ?  Is it happen only with ec2 ? 
          Hide
          lifeofguenter Gunter Grodotzki added a comment -

          If you can give me documentation on how to do this that would be great

           

          I am running master off docker jenkins/jenkins:lts-slim - but I am guessing the CPU issues with master are only a symptom. I haven't tried it with something non EC2. But given that the issue is on Ubuntu + Debian AMIs its probably not EC2.

           

          It would be really great if remoting.jar could have better support for newrelic so I can see 100% on what is consuming 100% cpu. Will be super easy to figure out whats going on. Right now there is no helpful data.

          Show
          lifeofguenter Gunter Grodotzki added a comment - If you can give me documentation on how to do this that would be great   I am running master off docker jenkins/jenkins:lts-slim - but I am guessing the CPU issues with master are only a symptom. I haven't tried it with something non EC2. But given that the issue is on Ubuntu + Debian AMIs its probably not EC2.   It would be really great if remoting.jar could have better support for newrelic so I can see 100% on what is consuming 100% cpu. Will be super easy to figure out whats going on. Right now there is no helpful data.

            People

            • Assignee:
              thoulen FABRIZIO MANFREDI
              Reporter:
              lifeofguenter Gunter Grodotzki
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: