Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Critical
    • Resolution: Fixed
    • Component/s: ssh-slaves-plugin
    • Labels:
      None
    • Environment:
      Jenkins 1.529
      OSX 10.8.4 (running as a VMWare Guest in VMWare Workstation 9.0.2 inside a Windows 7 Host)
      also Jenkins 1.645, OSX 10.9, 10.10 (not vm)
      also observed with Windows and Linux slaves.
    • Similar Issues:
    • Released As:
      ssh-slaves-1.31.1

      Description

      I configured an OSX slave to use an SSH connection. I have an identical setup for a Linux slave. The Linux slave never hangs, but the OSX one does randomly every couple of days.

      When the slave hangs, I see:

      This node is being launched. See log for more details
      

      When I click on more details I see an empty log (literally no characters) with a spinning wheel.

      I'd like to disconnect the channel and try again. Unfortunately, there is no "disconnect" button, seemingly because the hang occurs too early in the connection phase.

      The only way I found to fix this problem is restart Jenkins master. I believe this issue is high priority because:

      1. This hang occurs at least once a day (for over a week now).
      2. There is no known workaround.
      3. There is no way to recover except to restart the master node, which means that all running jobs have to be interrupted.

      If you can add extra logging, I can try collection more information for you. Where do we get started?

        Attachments

          Issue Links

            Activity

            Hide
            ifernandezcalvo Ivan Fernandez Calvo added a comment -

            The default settings on the connection timeout and retries should resolve this issue
            https://issues.jenkins-ci.org/browse/JENKINS-52739

            Show
            ifernandezcalvo Ivan Fernandez Calvo added a comment - The default settings on the connection timeout and retries should resolve this issue https://issues.jenkins-ci.org/browse/JENKINS-52739
            Hide
            ifernandezcalvo Ivan Fernandez Calvo added a comment - - edited

            Overall recommendations:

            Show
            ifernandezcalvo Ivan Fernandez Calvo added a comment - - edited Overall recommendations: It is recommended to use JDK nearest and in the same major version of Jenkins instance and Agents It is recommended to tune the TCP stack on of Jenkins instance and Agents On Linux http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html On Windows https://blogs.technet.microsoft.com/nettracer/2010/06/03/things-that-you-may-want-to-know-about-tcp-keepalives/ On Mac https://www.gnugk.org/keepalive.html You should check for hs_err_pid error files in the root fs of the agent http://www.oracle.com/technetwork/java/javase/felog-138657.html#gbwcy Check the logs in the root fs of the agent It is recommended to set the initial heap of the Agent to at least 512M (-Xmx512m -Xms512m), you could start with 512m and lower the value until you find a proper value to your Agents. Disable energy save options that suspend, or hibernate the host
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            FYI Ivan Fernandez Calvo. I have never been able to diagnose this issue in detail after the last patches, but it seems there are more unfixed run conditions.

            I have no capacity to work on it anytime soon, so I will assign it and let others to take it

            Show
            oleg_nenashev Oleg Nenashev added a comment - FYI Ivan Fernandez Calvo . I have never been able to diagnose this issue in detail after the last patches, but it seems there are more unfixed run conditions. I have no capacity to work on it anytime soon, so I will assign it and let others to take it
            Hide
            ovidiub13 Ovidiu-Florin Bogdan added a comment - - edited

            The Support Core plugin gives empty logs for the slave in discussion.

            The slave node get's no connection attempt via ssh from the master. Getting the slave stack trace is not possible since the slave.jar is not being executed.

            I'm having no luck with the nsenter utility to enter and obtain the master stack trace. I need to restart the container holding master with --privileged to be able to get the stack trace. THis would be rather tricky.

            P.S. Symlinking /dev/urandom to /dev/random on the slave has no affect. I realize now that I should've done this on the master.

            /dev/random on master has enough entropy, it works just fine.

            Show
            ovidiub13 Ovidiu-Florin Bogdan added a comment - - edited The Support Core plugin gives empty logs for the slave in discussion. The slave node get's no connection attempt via ssh from the master. Getting the slave stack trace is not possible since the slave.jar is not being executed. I'm having no luck with the nsenter utility to enter and obtain the master stack trace. I need to restart the container holding master with --privileged to be able to get the stack trace. THis would be rather tricky. P.S. Symlinking /dev/urandom to /dev/random on the slave has no affect. I realize now that I should've done this on the master. /dev/random on master has enough entropy, it works just fine.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Well, generally you need to dump stacktraces during the connection hanging somehow. https://forums.docker.com/t/how-to-dump-heap-from-a-java-program-running-in-container/3217 . Your mileage may vary.

            For master side you can also use https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin

            Show
            oleg_nenashev Oleg Nenashev added a comment - Well, generally you need to dump stacktraces during the connection hanging somehow. https://forums.docker.com/t/how-to-dump-heap-from-a-java-program-running-in-container/3217 . Your mileage may vary. For master side you can also use https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin

              People

              • Assignee:
                ifernandezcalvo Ivan Fernandez Calvo
                Reporter:
                cowwoc cowwoc
              • Votes:
                12 Vote for this issue
                Watchers:
                23 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: