Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50504

Jenkins is handing out workspaces that are already in use to new jobs

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Critical
    • Resolution: Fixed
    • Component/s: core
    • Environment:
    • Similar Issues:
    • Released As:
      2.173

      Description

      tl;dr When the master's connection to an SSH slave times out and a new connection is opened, references to the old Remoting channel still persist in the job and the workspace list. This means that in-use workspaces get handed out to new jobs (because the logic that checks for a workspace being in use doesn't take the reconnect case into account), and then both jobs clobber each other and fail.

      I have a Jenkins installation where I am using the SSH slaves plugin to run long-running jobs on two nodes. Every so often, due to networking issues, the nodes lose their connection to the master. When this happens, an I/O error occurs on the Remoting channel, and the following output is printed in the Jenkins console log:

      SEVERE: I/O error in channel jenkins-node
      INFO: Attempting to reconnect jenkins-node
      [03/16/18 12:50:29] SSH Launch of jenkins-node on jenkins-node.example.com completed in 25,604 ms
      

      As you can see, the node gets disconnected, and then Jenkins reconnects to it. Looking at the logs for the node in the Manage Nodes view, I can see that a brand new SSH connection was opened:

      [03/31/18 10:08:57] [SSH] Opening SSH connection to jenkins-node.example.com:22.
      [03/31/18 10:08:58] [SSH] Authentication successful.
      [03/31/18 10:08:58] [SSH] The remote user's environment is:
      [...]
      

      The timestamps reflect the current time, not the time the Jenkins master was originally launched.

      So far, so good. Now, bear in mind that I have several long-running jobs using this node, and these jobs keep running after the node disconnects and reconnects. The node never went down, it's just that its network was temporarily unavailable. The jobs keep running on it, and eventually (when the network comes back up) their results make it back to the master under the new Remoting Channel.

      At this point, new jobs start (against the newly reconnected node), and here is where the trouble starts. The new jobs get allocated a workspace that is already being used by a currently-running job from before the node disconnect/reconnect. This results in disaster for both jobs, as the new job will do a Git clone and clobber files in the workspace of the already running job. Both jobs will then fail. By the time users notice and contact me, dozens of jobs have failed.

      * * *

      I started looking into why this happens, and I believe I understand the cause. When Jenkins is allocating workspaces, it starts by calling Computer#getWorkspaeList to get the workspace list for the node in question. This class keeps track of the in-use workspaces in the following map:

      private final Map<FilePath,Entry> inUse = new HashMap<FilePath,Entry>();
      

      When a new job runs, a FilePath is constructed for the desired workspace path. The inUse map is then checked to see if that workspace has already been handed out to a running job, which uses FilePath#equals:

      @Override
      public boolean equals(Object o) {
          if (this == o) return true;
          if (o == null || getClass() != o.getClass()) return false;
      
          FilePath that = (FilePath) o;
      
          if (channel != null ? !channel.equals(that.channel) : that.channel != null) return false;
          return remote.equals(that.remote);
      }
      

      The problem here is that the candidate has the new Remoting channel (because the node disconnected and reconnected), while the entry in inUse has the old channel (because when the job started running, the old channel was still active). As a result, the equality check fails. Jenkins then hands out the workspace to the new job, even as the old job is still using it. Then both jobs clobber each other and fail. I verified that this was happening on my instance by running the following script in the Console Log:

      import jenkins.model.Jenkins
      
      def computer = null
      Jenkins.instance.computers.each {
        if (it.name == 'blackbox-slave2') {
          computer = it
        }
      }
      
      println computer.channel
      
      computer.workspaceList.inUse.each { key, value ->
        if (key.channel != computer.channel) {
          println "'In use' under an old channel (and therefore not really considered in use): " key
        }
      }
      
      println 'done'
      

      This printed several workspaces that were in the "in use" map under an old channel, and therefore not considered in use anymore from the perspective of a new job.

      * * *

      As a partial workaround, I can restart the Jenkins instance, which rehydrates the running jobs under the new Remoting channel. This prevents workspaces from being handed out when they are already being used, but it is often too little too late: by the time the problem is noticed, many jobs have already failed.

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            While testing a fix for JENKINS-41854, I have found that this bug remains. It seems to be a bug in Jenkins core (albeit one that could only affect Pipeline builds insofar as freestyle builds would have failed as soon as the agent disconnected anyway): WorkspaceList.inUse is keyed off of FilePath instances, which do not compare equal across the reconnection.

            Show
            jglick Jesse Glick added a comment - While testing a fix for JENKINS-41854 , I have found that this bug remains. It seems to be a bug in Jenkins core (albeit one that could only affect Pipeline builds insofar as freestyle builds would have failed as soon as the agent disconnected anyway): WorkspaceList.inUse is keyed off of FilePath instances, which do not compare equal across the reconnection.

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                basil Basil Crow
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: