Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-34448

Slave's offlineCause is rewritten during a rare condition

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • None

      SlaveComputer._connect() can rewrite given offlineCause during a rare use-case - race condition.

      Connect and Disconnect of a slave is executed in a separate threads (due to non-block an UI operation). It is possible to connect slave and during this operation something asks to disconnect it. SlaveComputer._connect() expects if channel is null it means something wrong happened while launching although an exception hadn't thrown. It writes OfflineCause.LaunchFailed cause as a reason for that even it is already given for other reason (e.g. disconnect by CLI). The problematic part of the code is here. If channel is null, there is always stored LaunchFailed cause although the situation is the result of other action (disconnect, temporary offline etc.).

      I'm convinced this part of code should be executed only when we have empty offlineCause here (so nothing recognized a launch issue yet). Also all listeners in the case an offlineCause is given are notified already (closed channel, terminated channel...).

      I've investigated this issue during test coverage of disconnect-node CLI command when I've had a simple scenario:

      DumbSlave slave = j.createSlave("aNode", "", null);
      slave.toComputer().waitUntilOnline();
      assertThat(slave.toComputer().isOnline(), equalTo(true));
      assertThat(slave.toComputer().getOfflineCause(), equalTo(null));
      
      CLICommandInvoker.Result result = command
          authorizedTo(Computer.DISCONNECT, Jenkins.READ)
          .invokeWithArgs("aNode");
      assertThat(slave.toComputer().isOffline(), equalTo(true));
      assertThat(slave.toComputer().getOfflineCause() instanceof OfflineCause.ByCLI, equalTo(true));
      

      where the last assert failed randomly (from 10 to 40 percent of executions) for the reason that offlineCause wasn't a ByCLI but LaunchFailed although slave had been correctly started before and was able to execute a job.

            pajasoft Pavel Janoušek
            pajasoft Pavel Janoušek
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: