Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-34448

Slave's offlineCause is rewritten during a rare condition

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Component/s: core
    • Labels:
      None
    • Similar Issues:

      Description

      SlaveComputer._connect() can rewrite given offlineCause during a rare use-case - race condition.

      Connect and Disconnect of a slave is executed in a separate threads (due to non-block an UI operation). It is possible to connect slave and during this operation something asks to disconnect it. SlaveComputer._connect() expects if channel is null it means something wrong happened while launching although an exception hadn't thrown. It writes OfflineCause.LaunchFailed cause as a reason for that even it is already given for other reason (e.g. disconnect by CLI). The problematic part of the code is here. If channel is null, there is always stored LaunchFailed cause although the situation is the result of other action (disconnect, temporary offline etc.).

      I'm convinced this part of code should be executed only when we have empty offlineCause here (so nothing recognized a launch issue yet). Also all listeners in the case an offlineCause is given are notified already (closed channel, terminated channel...).

      I've investigated this issue during test coverage of disconnect-node CLI command when I've had a simple scenario:

      DumbSlave slave = j.createSlave("aNode", "", null);
      slave.toComputer().waitUntilOnline();
      assertThat(slave.toComputer().isOnline(), equalTo(true));
      assertThat(slave.toComputer().getOfflineCause(), equalTo(null));
      
      CLICommandInvoker.Result result = command
          authorizedTo(Computer.DISCONNECT, Jenkins.READ)
          .invokeWithArgs("aNode");
      assertThat(slave.toComputer().isOffline(), equalTo(true));
      assertThat(slave.toComputer().getOfflineCause() instanceof OfflineCause.ByCLI, equalTo(true));
      

      where the last assert failed randomly (from 10 to 40 percent of executions) for the reason that offlineCause wasn't a ByCLI but LaunchFailed although slave had been correctly started before and was able to execute a job.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                pajasoft Pavel Janoušek
                Reporter:
                pajasoft Pavel Janoušek
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: