Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-18800

Jenkins selects channel to wrong node for build job

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Critical Critical
    • core, ssh-slaves-plugin
    • None
    • Ubuntu 12.04 64-bit
      Java(TM) SE Runtime Environment (build 1.6.0_45-b06) 64-bit

      Jenkins LTS 1.509.1
      Jenkins SSH plugin 2.3
      Jenkins SSH Slaves plugin 0.27

      In the last 12 months, we have encountered a very rare, but also very critical issue in Jenkins core or the SSH-Slaves plug-in.

      The issue is that Jenkins spontaneously enters a state in which it reproducibly selects the wrong channel for some of its connected build hosts. All build hosts are connected via the SSH-Slaves plug-in.

      This immediately leads failing builds, as they will not respect the workspace locks anymore, as they lock them on the correct host, but talk with a different host to execute builds.

      A typical log-output looks like this:

      -------------------
      14:45:35 Started by command line by <user>
      14:45:35 Building remotely on musxbird039 in workspace /local/jenkins_workspace/workspace/<PROJECT>
      [...]
      14:45:35 Checkout:<GIT-REPO> / /local/jenkins_workspace/workspace/<PROJECT> - hudson.remoting.Channel@37ecb28e:musxbird029
      -------------------

      As you can see, it selects musxbird039 for building, but uses the channel to musxbird029. Since the workspace is usually physically present on those machines, too, the build starts. Unfortunately, since the workspace is only locked on musxbird039, but not on musxbird029, a collision can occur freely.

      This leads, of course, to a vast variety of build failures.

      We do not know of a way to reliably reproduce this issue, as it appears randomly after some time. Sometimes it takes months to appear, sometimes only days.

      The only known way of repairing the bug is to disconnect both machines and let them restart their slaves and all their associated threads on the Jenkins master. Rebooting the server itself obviously also works.

      We are quite frankly stumped by this bug. Even examining the slave->channel allocation code of Jenkins ourselves did not lead to any clue.

      If you need more information, we will be happy to give them.

      Best regards,

      Martin Schröder
      Intel Mobile Communications GmbH.

            ifernandezcalvo Ivan Fernandez Calvo
            mhschroe Martin Schröder
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: