Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58492

Improve robustness w.r.t. bad slave nodes

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Major Major
    • hudson-wsclean-plugin
    • Master with multiple static slaves

      At present (version 1.0.5) the overall stability and performance of every build becomes only as good as the worst slave.
      In an ideal world this isn't a problem, but "in the real world" where not everything is perfect, we need to minimize the impact of a bad slave rather than maximize it. Fortunately, at least for this plugin, that shouldn't be too difficult to achieve

      This plugin should be more robust in its dealing with the other slaves:

      1. Timeouts.
        Surround all calls to the remote slaves with timeouts so that we can ensure that the cleanup stage cannot run indefinitely.
      2. Parallel execution.
        Run each remote deletion in a separate thread so that deletions on different slaves can happen in parallel.

      Justification:
      Scenario one:
      If a slave has locked up or is otherwise unresponsive (something we find happens, especially with Windows based slaves) then all builds (that might run on that slave) will end up locking up when they attempt to remove their workspace from that slave.
      If we had timeouts then, while we can't rescue the build that's running on the locked-up slave, at least all our other builds will continue unaffected, minimizing the impact of that badly-behaved slave node.
      Scenario two:
      When there's a lot of slaves, deleting each workspace in sequence can take a long time, causing big delays for the builds; the workspace cleanup phase of a build can be significantly longer than all of the rest of the build activity combined.
      If we ran each deletion in parallel then all the slaves could delete their workspaces in parallel, ensuring that the overall delay to the currently-running build was only as long as the slowest slave.

      Note: We could make this parallel/serial choice configurable, and we could make the timeout configurable too, with the default for existing configurations being "serial, no timeout" to preserve existing behavior. The Jelly code could set the defaults for new users to be "parallel, 5 minutes" or similar.

      TL;DR: Ensure that "this build on this slave" is unaffected by problems with "other slaves that this build could've used".

            pjdarton pjdarton
            pjdarton pjdarton
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: