Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43599

Distributed locks occasionally failing across nodes

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Labels:
    • Environment:
    • Similar Issues:

      Description

      Are pipeline locks expected to behave when used by two nodes simultaneously?

      My Jenkinsfile defines the following helper -

      Jenkinsfile
      def log(msg) {
        echo((new Date().format("yyyy-MM-dd'T'HH:mm:ss: ")) + msg)
      }
      
      def lockDB(number, callback) {
        try {
          log "${number}: Acquiring lock ..."
          lock(resource: "boom-test-database-${number}", inversePrecedence: true) {
            log "${number}: Acquired!"
            callback()
            log "${number}: Releasing..."
          }
        } finally {
          log "${number}: Released!"
        }
      }
      

      I somewhat frequently see two nodes manage to acquire the same lock simultaneously, as in the following example, where master locked a resource from 14:45:57 -> 14:52:04, and the slave locked the same resource from 14:51:24 -> 14:56:43.

      Job1
      [featureTests] Running on master in /var/jenkins_home/workspace/BoomMultiPipeline_master-YRHUUT2RENAD5FNHS4YJ5GQKTJDN2M7B3KGAU3ELBWX7MZ2AEKBA
      [featureTests] 2017-04-07T14:40:44: 1: Acquiring lock ...
      [Pipeline] [featureTests] lock
      [featureTests] Trying to acquire lock on [boom-test-database-1]
      [featureTests] Found 0 available resource(s). Waiting for correct amount: 1.
      [featureTests] [boom-test-database-1] is locked, waiting...
      [featureTests] Lock acquired on [boom-test-database-1]
      [featureTests] 2017-04-07T14:45:57: 1: Acquired!
      ... running tests
      [featureTests] Lock released on resource [boom-test-database-1]
      [featureTests] 2017-04-07T14:52:04: 1: Released!
      
      Job2
      [featureTests] Running on slave in /var/jenkins_home/workspace/eline_tb_remaining_indices2-DTE7IMSYRVCF6GO3T5IRRR52R4YQ3GSRFNEEKCRH7FGSUBQB4SFQ
      [featureTests] 2017-04-07T14:49:33: 1: Acquiring lock ...
      [Pipeline] [featureTests] lock
      [featureTests] Trying to acquire lock on [boom-test-database-1]
      [featureTests] Found 0 available resource(s). Waiting for correct amount: 1.
      [featureTests] [boom-test-database-1] is locked, waiting...
      [featureTests] Lock acquired on [boom-test-database-1]
      [featureTests] 2017-04-07T14:51:24: 1: Acquired!
      ... running tests
      [featureTests] Lock released on resource [boom-test-database-1]
      [featureTests] 2017-04-07T14:56:43: 1: Released!
      

      Is this a bug, or am I misunderstanding how locking is supposed to work in Jenkins?

        Attachments

          Activity

          Hide
          ext3h Andreas Ringlstetter added a comment -

          Are you sure that not just the execution of the pipeline on the master has been paused for ~30 seconds while releasing the lock?

          You have only logged the timestamp after execution has resumed, not immediately prior to actually releasing the lock. (Respectively you have removed the interesting "Releasing..." line from the log output.)

          Show
          ext3h Andreas Ringlstetter added a comment - Are you sure that not just the execution of the pipeline on the master has been paused for ~30 seconds while releasing the lock? You have only logged the timestamp after execution has resumed, not immediately prior to actually releasing the lock. (Respectively you have removed the interesting "Releasing..." line from the log output.)
          Hide
          jdelstrother Jonathan del Strother added a comment - - edited

          Ugh, the "Releasing" line isn't present in the original logs, because the tests failed so raised an exception in "callback". You're suggesting that in my lockDB function, execution just paused for 30 seconds between releasing the lock and the "finally { log }" line? (Not saying that's impossible, just making sure I understand you)

          I'll add an additional try-catch to ensure I print the "Releasing..." line, will let you know...

          Show
          jdelstrother Jonathan del Strother added a comment - - edited Ugh, the "Releasing" line isn't present in the original logs, because the tests failed so raised an exception in "callback". You're suggesting that in my lockDB function, execution just paused for 30 seconds between releasing the lock and the "finally { log }" line? (Not saying that's impossible, just making sure I understand you) I'll add an additional try-catch to ensure I print the "Releasing..." line, will let you know...
          Hide
          trambrant Thomas Rambrant added a comment - - edited

          I have the same problem so I'm very interested in a solution... 
           
          I have tried everything I can think of. I'm currently locking on the outmost level, so that every stage and node blocks is inside the lock. I previously had it inside a node block  but got the impression that the lock only worked for jobs on the same node. But setting it on the outmost level still fails sometimes, even for jobs executed on the same node..

          I have several jobs that is locked on the same resource. 

          1. Long running integration test that locks a database resource and runs for 3 hours
          2. 55 minutes after the start of the first job another job pipe-line job start that locks on the same database. The minute the second job starts to fiddle (drop table) with the database the first job crashes

          When I look in the log both jobs say that they have locked on the same resource

          Show
          trambrant Thomas Rambrant added a comment - - edited I have the same problem so I'm very interested in a solution...    I have tried everything I can think of. I'm currently locking on the outmost level, so that every stage and node blocks is inside the lock. I previously had it inside a node block  but got the impression that the lock only worked for jobs on the same node. But setting it on the outmost level still fails sometimes, even for jobs executed on the same node.. I have several jobs that is locked on the same resource.  Long running integration test that locks a database resource and runs for 3 hours 55 minutes after the start of the first job another job pipe-line job start that locks on the same database. The minute the second job starts to fiddle (drop table) with the database the first job crashes When I look in the log both jobs say that they have locked on the same resource

            People

            • Assignee:
              Unassigned
              Reporter:
              jdelstrother Jonathan del Strother
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: