Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39179

All builds hang, JNA load deadlock on Windows slave

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      I hate to create a general "core" bug, as I wish I could redirect this to the correct component. Unfortunately, I can not identify which component is hanging and why, so I do not know how to direct this problem.

      This problem started about 2 weeks ago, as we have been adding new Pipeline builds to our build server. So it could be related to one of the pipeline plugins.

      The behavior is the following:

      • 1 to 2 times a day, all builds on all build slaves will hang. The console log of the build just stops moving forward, and stays stuck at the last line executed / last line returned.
      • Once this occurs, attempting to stop a build fails. Clicking stop results in no change in the build status or console log output
      • New builds will not start. They sit in the queue, but the slaves will not be started.
      • The UI continues to function, so it is possible to view config, get threaddumps, etc.

      The only resolution is to restart the Jenkins server.

      We are using the vCenter plugin to dynamically start all build slaves. Though, we have been using this configuration for months, and the problem just started.

      We have recreated this on both latest Jenkins level (2.26) and Jenkins LTS version 2.19.1

      I am attaching a threaddump of the server at the time of one of these hangs.

      I can provide any other information that might help in diagnosing this problem

        Attachments

          Issue Links

            Activity

            Hide
            pjdarton pjdarton added a comment -

            I agree.
            In my experience, "isSymlink" is called a lot on Windows, especially when deleting things from disk.
            I'd also guess that "isSymlink" usage drowns-out all other JNA usage.

            Show
            pjdarton pjdarton added a comment - I agree. In my experience, "isSymlink" is called a lot on Windows, especially when deleting things from disk. I'd also guess that "isSymlink" usage drowns-out all other JNA usage.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Devin Nusbaum
            Path:
            core/src/main/java/hudson/Util.java
            core/src/main/java/hudson/util/jna/Kernel32Utils.java
            core/src/test/java/hudson/FilePathTest.java
            core/src/test/java/hudson/UtilTest.java
            http://jenkins-ci.org/commit/jenkins/52fa4d90b938243ccc273955caa7262154b9f688
            Log:
            JENKINS-39179 JENKINS-36088 Always use NIO to create and detect symbolic links and Windows junctions (#3133)

            • Always use NIO to detect symlinks
            • Make assertion failure message consistent
            • Catch NoSuchFileException to keep tests passing
            • Make method name more specific and simlify assumption
            • Remove obsolete comment and reword the main comment in isSymlink
            • Deprecate Kernel32Util#isJunctionOrSymlink
            • Use assumptions for junction creation and add messages to assumptions
            • Replace deprecated code with recommended alternative
            • Add comment explaining call to DosFileAttributes#isOther
            • Do not fall back to native code when creating symlinks
            • Log FileSystemExceptions when creating symbolic links
            • Catch InvalidPathException and rethrow as IOException
            • Deprecate Kernel32Utils#createSymbolicLink and #getWin32FileAttributes
            • Preserve original logging behavior on Windows and remove useless call to Util#displayIOException
            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Devin Nusbaum Path: core/src/main/java/hudson/Util.java core/src/main/java/hudson/util/jna/Kernel32Utils.java core/src/test/java/hudson/FilePathTest.java core/src/test/java/hudson/UtilTest.java http://jenkins-ci.org/commit/jenkins/52fa4d90b938243ccc273955caa7262154b9f688 Log: JENKINS-39179 JENKINS-36088 Always use NIO to create and detect symbolic links and Windows junctions (#3133) Always use NIO to detect symlinks Make assertion failure message consistent Catch NoSuchFileException to keep tests passing Make method name more specific and simlify assumption Remove obsolete comment and reword the main comment in isSymlink Deprecate Kernel32Util#isJunctionOrSymlink Use assumptions for junction creation and add messages to assumptions Replace deprecated code with recommended alternative Add comment explaining call to DosFileAttributes#isOther Do not fall back to native code when creating symlinks Log FileSystemExceptions when creating symbolic links Catch InvalidPathException and rethrow as IOException Deprecate Kernel32Utils#createSymbolicLink and #getWin32FileAttributes Preserve original logging behavior on Windows and remove useless call to Util#displayIOException
            Hide
            jglick Jesse Glick added a comment -

            I attached a build of an experimental plugin to this page; sources on GitHub: avoid-agent-jna-deadlock-plugin. It may work around the problem, and more easily than the previous workaround of configuring -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native on every agent (since you need merely install the plugin for the workaround to take effect). Without knowing how to reproduce the problem from scratch, I cannot confirm that it helps.

            The JNA fix is as yet unreleased—scheduled for JNA 5.0.0 (due to its introducing an incompatible API change). Jenkins still uses 4.2.1. Updating to the current release 4.5.0 would not help in this regard, and I am loath to begin using an unreleased custom build or fork.

            The direction we would like to take is to simply avoid using JNA at all from core, unless there is no plausible alternative. That has already been done in the case mentioned here, that of FilePath.deleteRecursive. See also workflow-support PR 48 which may help.

            Show
            jglick Jesse Glick added a comment - I attached a build of an experimental plugin to this page; sources on GitHub:  avoid-agent-jna-deadlock-plugin . It may work around the problem, and more easily than the previous workaround of configuring -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native on every agent (since you need merely install the plugin for the workaround to take effect). Without knowing how to reproduce the problem from scratch, I cannot confirm that it helps. The JNA fix is as yet unreleased—scheduled for JNA 5.0.0 (due to its introducing an incompatible API change). Jenkins still uses 4.2.1. Updating to the current release 4.5.0 would not help in this regard, and I am loath to begin using an unreleased custom build or fork. The direction we would like to take is to simply avoid using JNA at all from core, unless there is no plausible alternative. That has already been done in the case mentioned here, that of FilePath.deleteRecursive . See also  workflow-support PR 48  which may help.
            Hide
            heldermagalhaes Helder Magalhães added a comment -

            @Jesse Glick: I've verified that the plug-in works properly for Windows slaves. Unfortunately we have a mixed installation base of Linux slaves as well, which break when "Launch slave agents via SSH" option is used:

             

            <===[JENKINS REMOTING CAPACITY]===>channel started
            Slave.jar version: 2.53.2
            This is a Unix slave
            Preloading JNA to avoid JENKINS-39179
            Slave JVM has not reported exit code. Is it still running?
            [04/23/18 08:29:08] Launch failed - cleaning up connection
            [04/23/18 08:29:08] [SSH] Connection closed.
            ERROR: Connection terminated
            

            I'm attaching MyLinuxSlave-SystemInformation.txt. May the problem be related with using a somehow old (1.7) Java version?

            Although it doesn't work (yet), thanks for the effort! I really prefer this to be the way (instead of changing configuration in all Windows nodes) until an official fix is provided.

             

            Show
            heldermagalhaes Helder Magalhães added a comment - @ Jesse Glick : I've verified that the plug-in works properly for Windows slaves. Unfortunately we have a mixed installation base of Linux slaves as well, which break when "Launch slave agents via SSH" option is used:   <===[JENKINS REMOTING CAPACITY]===>channel started Slave.jar version: 2.53.2 This is a Unix slave Preloading JNA to avoid JENKINS-39179 Slave JVM has not reported exit code. Is it still running? [04/23/18 08:29:08] Launch failed - cleaning up connection [04/23/18 08:29:08] [SSH] Connection closed. ERROR: Connection terminated I'm attaching MyLinuxSlave-SystemInformation.txt . May the problem be related with using a somehow old (1.7) Java version? Although it doesn't work (yet), thanks for the effort! I really prefer this to be the way (instead of changing configuration in all Windows nodes) until an official fix is provided.  
            Hide
            pjdarton pjdarton added a comment -

            Helder Magalhães You should be using Java 8 (aka 1.8) on both the master and slaves.  Support for 1.7 ceased last year.  See https://jenkins.io/blog/2017/04/10/jenkins-has-upgraded-to-java-8/

            If you're using (very) different Javas on the masters and slaves then you can get weird errors.

            Show
            pjdarton pjdarton added a comment - Helder Magalhães You should be using Java 8 (aka 1.8) on both the master and slaves.  Support for 1.7 ceased last year.  See https://jenkins.io/blog/2017/04/10/jenkins-has-upgraded-to-java-8/ If you're using (very) different Javas on the masters and slaves then you can get weird errors.

              People

              • Assignee:
                Unassigned
                Reporter:
                gregcovertsmith Greg Smith
              • Votes:
                3 Vote for this issue
                Watchers:
                16 Start watching this issue

                Dates

                • Created:
                  Updated: