Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-37293

ECS Agents are not deleted because Jenkins is unable to delete logs due to NFS locks

    Details

    • Similar Issues:

      Description

      We have our Jenkins master running on a NFS 4.1 mount (AWS EFS).

      When a job completes we see the following in the logs:

      Aug 09, 2016 4:21:09 PM INFO com.cloudbees.jenkins.plugins.amazonecs.ECSService deleteTask
      Delete ECS Slave task: arn:aws:ecs:us-east-1:XXXXXXXXXXXX:task/80b3bc38-774c-4bbe-9e79-f9db126a2b9b
      Aug 09, 2016 4:21:09 PM WARNING hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor monitor
      Failed to monitor jenkins-ecs-d6d51f89f024 for Free Disk Space
      java.util.concurrent.ExecutionException: hudson.remoting.RequestAbortedException: hudson.remoting.Channel$OrderlyShutdown
      	at hudson.remoting.Request$1.get(Request.java:282)
      	at hudson.remoting.Request$1.get(Request.java:207)
      	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:96)
      	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
      Caused by: hudson.remoting.RequestAbortedException: hudson.remoting.Channel$OrderlyShutdown
      	at hudson.remoting.Request.abort(Request.java:303)
      	at hudson.remoting.Channel.terminate(Channel.java:847)
      	at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1080)
      	at hudson.remoting.Channel$1.handle(Channel.java:501)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:83)
      Caused by: hudson.remoting.Channel$OrderlyShutdown
      	... 3 more
      Caused by: Command close created at
      	at hudson.remoting.Command.<init>(Command.java:62)
      	at hudson.remoting.Command.<init>(Command.java:47)
      	at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:822)
      	at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:822)
      	at hudson.remoting.Channel.close(Channel.java:867)
      	at hudson.remoting.Channel$CloseCommand.execute(Channel.java:825)
      	at hudson.remoting.Channel$ReaderThread.run(Channel.java:1106)
      
      Aug 09, 2016 4:21:09 PM WARNING hudson.slaves.SlaveComputer kill
      Unable to delete agent logs
      java.io.IOException: Unable to delete '/var/lib/jenkins/logs/slaves/jenkins-ecs-d6d51f89f024'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
      	at hudson.Util.deleteRecursive(Util.java:355)
      	at hudson.slaves.SlaveComputer.kill(SlaveComputer.java:670)
      	at hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:86)
      	at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:205)
      	at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1338)
      	at jenkins.model.Nodes$4.run(Nodes.java:219)
      	at hudson.model.Queue._withLock(Queue.java:1320)
      	at hudson.model.Queue.withLock(Queue.java:1197)
      	at jenkins.model.Nodes.removeNode(Nodes.java:210)
      	at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
      	at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:70)
      	at com.cloudbees.jenkins.plugins.amazonecs.ECSComputer.terminate(ECSComputer.java:62)
      	at com.cloudbees.jenkins.plugins.amazonecs.ECSComputer.taskCompleted(ECSComputer.java:47)
      	at hudson.model.queue.WorkUnitContext.synchronizeEnd(WorkUnitContext.java:140)
      	at hudson.model.Executor.finish1(Executor.java:457)
      	at hudson.model.Executor.run(Executor.java:430)
      Caused by: java.nio.file.FileSystemException: /var/lib/jenkins/logs/slaves/jenkins-ecs-d6d51f89f024/.nfs9e9ba4fd7c349fa200000003: Device or resource busy
      	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
      	at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
      	at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
      	at java.nio.file.Files.deleteIfExists(Files.java:1165)
      	at hudson.Util.tryOnceDeleteFile(Util.java:296)
      	at hudson.Util.tryOnceDeleteRecursive(Util.java:373)
      	at hudson.Util.tryOnceDeleteContentsRecursive(Util.java:392)
      	at hudson.Util.tryOnceDeleteRecursive(Util.java:372)
      	at hudson.Util.deleteRecursive(Util.java:350)
      	... 15 more
      

      Looking at the Jenkins master I do indeed see the log file still open:

      eborgstrom@devops-jenkins-7d98bfe3:/var/lib/jenkins$ sudo lsof +D /var/lib/jenkins/logs/slaves/jenkins-ecs-d6d51f89f024
      COMMAND   PID    USER   FD   TYPE DEVICE SIZE/OFF                 NODE NAME
      java    26119 jenkins  486w   REG   0,25       22 11428909888000270242 /var/lib/jenkins/logs/slaves/jenkins-ecs-d6d51f89f024/.nfs9e9ba4fd7c349fa200000003
      
      eborgstrom@devops-jenkins-7d98bfe3:/var/lib/jenkins$ find /var/lib/jenkins/logs/slaves/jenkins-ecs-d6d51f89f024 -ls
      414054371676357616    4 drwxr-xr-x   2 jenkins  jenkins      4096 Aug  9 23:21 /var/lib/jenkins/logs/slaves/jenkins-ecs-d6d51f89f024
      11428909888000270242    4 -rw-r--r--   1 jenkins  jenkins        22 Aug  9 23:21 /var/lib/jenkins/logs/slaves/jenkins-ecs-d6d51f89f024/.nfs9e9ba4fd7c349fa200000003
      

      This means agents remain around and need to be cleaned up manually.

      The logs should be fully closed before the slave is deleted.

        Attachments

          Activity

          Hide
          borgstrom Evan Borgstrom added a comment -

          FWIW, I remounted our EFS files system with the noatime,nodiratime options today and it seems to be a solid workaround for the issue, slaves are being cleaned up as expected.

          While I intend to leave those mount options on, this should still be fixed.

          Show
          borgstrom Evan Borgstrom added a comment - FWIW, I remounted our EFS files system with the noatime,nodiratime options today and it seems to be a solid workaround for the issue, slaves are being cleaned up as expected. While I intend to leave those mount options on, this should still be fixed.
          Hide
          borgstrom Evan Borgstrom added a comment -

          This just happened again, so the noatime,nodiratime work around doesn't actually work...

          Aug 12, 2016 9:03:29 AM WARNING hudson.slaves.SlaveComputer kill
          Unable to delete agent logs
          java.io.IOException: Unable to delete '/var/lib/jenkins/logs/slaves/ecs-1aaa9aed01d09'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
             ...
          Caused by: java.nio.file.FileSystemException: /var/lib/jenkins/logs/slaves/ecs-1aaa9aed01d09/.nfs8f2ea0ecf5e622fd00000022: Device or resource busy
          
          Show
          borgstrom Evan Borgstrom added a comment - This just happened again, so the noatime,nodiratime work around doesn't actually work... Aug 12, 2016 9:03:29 AM WARNING hudson.slaves.SlaveComputer kill Unable to delete agent logs java.io.IOException: Unable to delete '/var/lib/jenkins/logs/slaves/ecs-1aaa9aed01d09'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. ... Caused by: java.nio.file.FileSystemException: /var/lib/jenkins/logs/slaves/ecs-1aaa9aed01d09/.nfs8f2ea0ecf5e622fd00000022: Device or resource busy
          Hide
          borgstrom Evan Borgstrom added a comment -

          Just upgraded from 1.4 to 1.5. The issue persists.

          Show
          borgstrom Evan Borgstrom added a comment - Just upgraded from 1.4 to 1.5. The issue persists.
          Hide
          pgarbe Philipp Garbe added a comment -

          Is this still an issue?

          Show
          pgarbe Philipp Garbe added a comment - Is this still an issue?
          Hide
          pgarbe Philipp Garbe added a comment -

          No feedback since months

          Show
          pgarbe Philipp Garbe added a comment - No feedback since months

            People

            • Assignee:
              roehrijn2 Jan Roehrich
              Reporter:
              borgstrom Evan Borgstrom
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: