Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58492

Improve robustness w.r.t. bad slave nodes

    Details

    • Similar Issues:

      Description

      At present (version 1.0.5) the overall stability and performance of every build becomes only as good as the worst slave.
      In an ideal world this isn't a problem, but "in the real world" where not everything is perfect, we need to minimize the impact of a bad slave rather than maximize it. Fortunately, at least for this plugin, that shouldn't be too difficult to achieve

      This plugin should be more robust in its dealing with the other slaves:

      1. Timeouts.
        Surround all calls to the remote slaves with timeouts so that we can ensure that the cleanup stage cannot run indefinitely.
      2. Parallel execution.
        Run each remote deletion in a separate thread so that deletions on different slaves can happen in parallel.

      Justification:
      Scenario one:
      If a slave has locked up or is otherwise unresponsive (something we find happens, especially with Windows based slaves) then all builds (that might run on that slave) will end up locking up when they attempt to remove their workspace from that slave.
      If we had timeouts then, while we can't rescue the build that's running on the locked-up slave, at least all our other builds will continue unaffected, minimizing the impact of that badly-behaved slave node.
      Scenario two:
      When there's a lot of slaves, deleting each workspace in sequence can take a long time, causing big delays for the builds; the workspace cleanup phase of a build can be significantly longer than all of the rest of the build activity combined.
      If we ran each deletion in parallel then all the slaves could delete their workspaces in parallel, ensuring that the overall delay to the currently-running build was only as long as the slowest slave.

      Note: We could make this parallel/serial choice configurable, and we could make the timeout configurable too, with the default for existing configurations being "serial, no timeout" to preserve existing behavior. The Jelly code could set the defaults for new users to be "parallel, 5 minutes" or similar.

      TL;DR: Ensure that "this build on this slave" is unaffected by problems with "other slaves that this build could've used".

        Attachments

          Issue Links

            Activity

            Hide
            pjdarton pjdarton added a comment -

            I've created a PR that'll address this issue ... but until the plugin code has a Jenkinsfile (see PR#5) folks will have to build their own copy of the code.

            Show
            pjdarton pjdarton added a comment - I've created a PR that'll address this issue ... but until the plugin code has a Jenkinsfile (see PR#5 ) folks will have to build their own copy of the code.
            Hide
            pjdarton pjdarton added a comment -

            Update for anyone watching this:
            There's now a Pull Request that contains a fix for this issue (plus other enhancements). Anyone can download the .hpi file of the bugfixed plugin from there and then upload that (Manage Jenkins -> Manage Plugins ->Advanced -> Upload plugin) to their own Jenkins server(s) to try it out.

            Once I'm confident that everything is OK then I'll merge those changes in and release the new plugin officially.

            Show
            pjdarton pjdarton added a comment - Update for anyone watching this: There's now a Pull Request that contains a fix for this issue (plus other enhancements). Anyone can download the .hpi file of the bugfixed plugin from there and then upload that (Manage Jenkins -> Manage Plugins ->Advanced -> Upload plugin) to their own Jenkins server(s) to try it out. Once I'm confident that everything is OK then I'll merge those changes in and release the new plugin officially.
            Hide
            pjdarton pjdarton added a comment -

            Fixed in version 1.0.6, which was released today.

            Show
            pjdarton pjdarton added a comment - Fixed in version 1.0.6, which was released today.

              People

              • Assignee:
                pjdarton pjdarton
                Reporter:
                pjdarton pjdarton
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: