Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-15331

Workaround Windows unpredictable file locking in Util.deleteContentsRecursive

    Details

    • Type: Improvement
    • Status: Resolved (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Component/s: core
    • Labels:
      None
    • Environment:
      Microsoft Windows
    • Similar Issues:

      Description

      Please enhance the hudson.Util.deleteContentsRecursive method to:

      1. delete everything it can
      2. try several times to delete everything
      3. only throw an exception if it can't delete everything (listing everything that it can't delete)

      Reasoning...
      Unlike unix, the Microsoft Windows OS does not allow a file to be deleted if something has that file open. This causes delete operations to fail.
      Furthermore, most installations of Windows have software that monitors the filesystem for activity and then inspects the contents of recently added/removed files (which means that it'll lock them, albeit temporarily), e.g. the Windows Search service & anti-virus software to name but two (but Windows Vista & Windows 7 seem to have additional complications)

      This means that builds which rely on cleaning a workspace before they start will sometimes fail (claiming that they couldn't delete everything because a file was locked), resulting in a build failing with the following output:

      Started by an SCM change
      Building remotely on jenkinsslave27 in workspace C:\hudsonSlave\workspace\MyProject
      Purging workspace...
      hudson.util.IOException2: remote file operation failed: C:\hudsonSlave\workspace\MyProject at hudson.remoting.Channel@6f0564d7:jenkinsslave27
      	at hudson.FilePath.act(FilePath.java:835)
      	at hudson.FilePath.act(FilePath.java:821)
      	at hudson.plugins.accurev.AccurevSCM.checkout(AccurevSCM.java:331)
      	at hudson.model.AbstractProject.checkout(AbstractProject.java:1218)
      	at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:586)
      	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:475)
      	at hudson.model.Run.run(Run.java:1434)
      	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      	at hudson.model.ResourceController.execute(ResourceController.java:88)
      	at hudson.model.Executor.run(Executor.java:239)
      Caused by: java.io.IOException: Unable to delete C:\hudsonSlave\workspace\MyProject\...\src\...\foo - files in dir: [C:\hudsonSlave\workspace\MyProject\...\src\...\foo\bar]
      	at hudson.Util.deleteFile(Util.java:236)
      	at hudson.Util.deleteRecursive(Util.java:287)
      	at hudson.Util.deleteContentsRecursive(Util.java:198)
      	at hudson.Util.deleteRecursive(Util.java:278)
      	at hudson.Util.deleteContentsRecursive(Util.java:198)
      	at hudson.Util.deleteRecursive(Util.java:278)
      	at hudson.Util.deleteContentsRecursive(Util.java:198)
      	at hudson.Util.deleteRecursive(Util.java:278)
      	at hudson.Util.deleteContentsRecursive(Util.java:198)
      	at hudson.Util.deleteRecursive(Util.java:278)
      	at hudson.Util.deleteContentsRecursive(Util.java:198)
      	at hudson.Util.deleteRecursive(Util.java:278)
      	at hudson.Util.deleteContentsRecursive(Util.java:198)
      	at hudson.Util.deleteRecursive(Util.java:278)
      	at hudson.Util.deleteContentsRecursive(Util.java:198)
      	at hudson.Util.deleteRecursive(Util.java:278)
      	at hudson.Util.deleteContentsRecursive(Util.java:198)
      	at hudson.plugins.accurev.PurgeWorkspaceContents.invoke(PurgeWorkspaceContents.java:28)
      	at hudson.plugins.accurev.PurgeWorkspaceContents.invoke(PurgeWorkspaceContents.java:11)
      	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2161)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:118)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:48)
      	at hudson.remoting.Request$2.run(Request.java:287)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      	at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
      	at java.util.concurrent.FutureTask.run(Unknown Source)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      	at hudson.remoting.Engine$1$1.run(Engine.java:60)
      	at java.lang.Thread.run(Unknown Source)
      

      What's needed is a retry mechanism. i.e. the equivalent of using Ant's <retry><delete file="foo"/></retry>, but with a (small) delay between attempts (and maybe a call to the garbage collector, just in case the process holding the file open is the build slave process itself).

        Attachments

          Issue Links

            Activity

            pjdarton pjdarton created issue -
            Hide
            pjdarton pjdarton added a comment -

            Note: This file locking behavior also causes non-Jenkins issues, e.g. deleting multiple folders using Windows explorer will sometimes leave one (usually empty) folder behind, and even a simple "RD /S /Q MyFolder" will sometimes fail to delete the folder on its first attempt. In these cases, simply retrying the operation will succeed. Personally, I think it's a Windows "feature".

            As a workaround, I've wrapped most of my calls to Ant's <delete> task in <retry>, and this has eliminated this problem from any of my builds that manage to start BUT this doesn't help if Jenkins doesn't get as far as running my builds.
            e.g. I'm using the accurev plugin for my SCM and it cleans the working directory before it grabs the source - I typically get about a 1% failure rate at this stage. Whilst 1% is not a blocking issue, it's not reliable, which is not what one wants from a build system.

            Personally, I've found that excluding the build areas from Search & anti-virus helps reduce the problem, but it is insufficient to stop these failures completely (at least on Windows 7) - something, somewhere, will still lock files, sometimes, but any investigation (after the build has failed failed) shows that no process has the file "open".

            Show
            pjdarton pjdarton added a comment - Note: This file locking behavior also causes non-Jenkins issues, e.g. deleting multiple folders using Windows explorer will sometimes leave one (usually empty) folder behind, and even a simple "RD /S /Q MyFolder" will sometimes fail to delete the folder on its first attempt. In these cases, simply retrying the operation will succeed. Personally, I think it's a Windows "feature". As a workaround, I've wrapped most of my calls to Ant's <delete> task in <retry>, and this has eliminated this problem from any of my builds that manage to start BUT this doesn't help if Jenkins doesn't get as far as running my builds. e.g. I'm using the accurev plugin for my SCM and it cleans the working directory before it grabs the source - I typically get about a 1% failure rate at this stage. Whilst 1% is not a blocking issue, it's not reliable, which is not what one wants from a build system. Personally, I've found that excluding the build areas from Search & anti-virus helps reduce the problem, but it is insufficient to stop these failures completely (at least on Windows 7) - something, somewhere, will still lock files, sometimes, but any investigation (after the build has failed failed) shows that no process has the file "open".
            Hide
            pjdarton pjdarton added a comment - - edited

            Features:

            • Added two new system properties that control behavior: "Util.deletionRetries" (an integer, defaults to 3) and "Util.deletionRetryWait" (an integer, defaults to 500ms).
            • Delete operations that affect directories now try to delete the entire contents of the directory, continuing on to subfolders etc even after encountering files that wouldn't die, before eventually throwing an exception about what wouldn't die. i.e. if a folder has a file "a", "b" and "c", and you can't delete "b", then "a" and "c" would get deleted (and you'll still get the exception about "b").
            • Delete operations now have multiple attempts at deleting things, so if not everything could be deleted first time around, maybe they'll get deleted 2nd/3rd etc time around. An exception is only thrown if all retry attempts are exhausted and there are still files/directories that won't delete.
            • Added some unit tests for these methods.
            • After posting this back in October 2012, I built a version of Jenkins LTS with this patch applied. I've been using it at work for all our development stuff and I've not had file locking problems since. I'm pretty confident that it fixes the problem.

            Disclaimers:

            • I've not tested this on Linux (or the unit-tests). It should be harmless (behaviorial changes are conditional on being on Windows), but it'd be worth running the unit-tests on Linux just to verify that.
            Show
            pjdarton pjdarton added a comment - - edited Features: Added two new system properties that control behavior: "Util.deletionRetries" (an integer, defaults to 3) and "Util.deletionRetryWait" (an integer, defaults to 500ms). Delete operations that affect directories now try to delete the entire contents of the directory, continuing on to subfolders etc even after encountering files that wouldn't die, before eventually throwing an exception about what wouldn't die. i.e. if a folder has a file "a", "b" and "c", and you can't delete "b", then "a" and "c" would get deleted (and you'll still get the exception about "b"). Delete operations now have multiple attempts at deleting things, so if not everything could be deleted first time around, maybe they'll get deleted 2nd/3rd etc time around. An exception is only thrown if all retry attempts are exhausted and there are still files/directories that won't delete. Added some unit tests for these methods. After posting this back in October 2012, I built a version of Jenkins LTS with this patch applied. I've been using it at work for all our development stuff and I've not had file locking problems since. I'm pretty confident that it fixes the problem. Disclaimers: I've not tested this on Linux (or the unit-tests). It should be harmless (behaviorial changes are conditional on being on Windows), but it'd be worth running the unit-tests on Linux just to verify that.
            Hide
            pjdarton pjdarton added a comment -
            Show
            pjdarton pjdarton added a comment - JENKINS-15331 should fix JENKINS-10905 .
            pjdarton pjdarton made changes -
            Field Original Value New Value
            Link This issue is related to JENKINS-10905 [ JENKINS-10905 ]
            pjdarton pjdarton made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            Hide
            pjdarton pjdarton added a comment -

            Uploaded git patch file; this was produced using the git command-line and isn't claiming to change the entire file. This will probably be a lot easier to merge.

            This is my "New-and-improved" solution.
            In addition to retrying the deletes, this also calls System.gc() if it's on Windows (a tactic that's also used in Apache Ant's Delete task to workaround the same problem).

            Show
            pjdarton pjdarton added a comment - Uploaded git patch file; this was produced using the git command-line and isn't claiming to change the entire file. This will probably be a lot easier to merge. This is my "New-and-improved" solution. In addition to retrying the deletes, this also calls System.gc() if it's on Windows (a tactic that's also used in Apache Ant's Delete task to workaround the same problem).
            pjdarton pjdarton made changes -
            Attachment 0001-JENKINS-15331.patch [ 22814 ]
            Hide
            pjdarton pjdarton added a comment -

            Have re-done my GitHub pull request to reflect the new changes (and to fix the CRLF issue with the previous pull request).
            New pull request is https://github.com/jenkinsci/jenkins/pull/615

            Show
            pjdarton pjdarton added a comment - Have re-done my GitHub pull request to reflect the new changes (and to fix the CRLF issue with the previous pull request). New pull request is https://github.com/jenkinsci/jenkins/pull/615
            pjdarton pjdarton made changes -
            Link This issue is related to JENKINS-3053 [ JENKINS-3053 ]
            Hide
            pjdarton pjdarton added a comment -

            I've now been running the LTS Jenkins build (1.480.1) with this patch applied at work for a while.
            I've not seen any builds failing due to "file in use" since.
            I would therefore recommend that this patch / pull-request be incorporated into the main branch ASAP, and to the next LTS release.

            Show
            pjdarton pjdarton added a comment - I've now been running the LTS Jenkins build (1.480.1) with this patch applied at work for a while. I've not seen any builds failing due to "file in use" since. I would therefore recommend that this patch / pull-request be incorporated into the main branch ASAP, and to the next LTS release.
            dankirkd Daniel Kirkdorffer made changes -
            Link This issue is related to JENKINS-15852 [ JENKINS-15852 ]
            dankirkd Daniel Kirkdorffer made changes -
            Priority Minor [ 4 ] Major [ 3 ]
            Hide
            dankirkd Daniel Kirkdorffer added a comment -

            I believe this is also the root cause of JENKINS-15852. The Git Plugin has a call in GitAPI to FilePath.deleteRecursive(), which in turn calls Util.deleteRecursive(). It is almost immediately trying to delete a workspace that has just been created. Additionally, we have encryption and McAfee software monitoring files that could be locking them.

            Show
            dankirkd Daniel Kirkdorffer added a comment - I believe this is also the root cause of JENKINS-15852 . The Git Plugin has a call in GitAPI to FilePath.deleteRecursive(), which in turn calls Util.deleteRecursive(). It is almost immediately trying to delete a workspace that has just been created. Additionally, we have encryption and McAfee software monitoring files that could be locking them.
            Hide
            pjdarton pjdarton added a comment - - edited

            File-locking is the bane of anyone running any kind of automated system on Windows, so I'd agree that this might well solve the problem (as long as you're sure that the Git code doesn't use the workspace as its current directory, as no amount of retrying will change that).

            I also have anti-virus stuff running on my build slaves, and despite that I've not noticed any builds fail due to file-locking issues since I started running a custom build of Jenkins LTS that has this fix in it.
            I think that this amounts to a fair amount of circumstantial evidence that this fix works.

            Show
            pjdarton pjdarton added a comment - - edited File-locking is the bane of anyone running any kind of automated system on Windows, so I'd agree that this might well solve the problem (as long as you're sure that the Git code doesn't use the workspace as its current directory, as no amount of retrying will change that). I also have anti-virus stuff running on my build slaves, and despite that I've not noticed any builds fail due to file-locking issues since I started running a custom build of Jenkins LTS that has this fix in it. I think that this amounts to a fair amount of circumstantial evidence that this fix works.
            Hide
            per_westling Per Westling added a comment -

            This is a very interesting patch, as we encounter a similar bug several times a week.

            Will this be added to the Jenkins releases in the near future?

            Show
            per_westling Per Westling added a comment - This is a very interesting patch, as we encounter a similar bug several times a week. Will this be added to the Jenkins releases in the near future?
            Hide
            brian3791 Brian Brooks added a comment - - edited

            We are encountering a similar problem that I originally attributed to some kind of weird conflict between

            Use private Maven repository

            and

            SCM / Subversion / Check-out Strategy / Always checkout a fresh copy

            Not sure why a Maven repo entry local to the workspace would be locked before the code is even checked out. Maven shouldn't even be running yet and no process other than the Jenkins job which uses this workspace should be referencing a workspace private maven repo entry.

            Environment:

            • Jenkins 1.517
            • Maven 3.4 (-Xmx1536m -XX:MaxPermSize=256m)
            • Java 1.7.0_15-b03 Oracle JVM 64-bit
            • Windows 2008 Server 64-bit
            • Clean server with no virus scanner, indexing, etc.
            • Dell PowerEdge 2950
            • PERC 5i Serial Attached SCSI controller
              This machine has 2 CPUs with 4 cores each (a total of 8 cores).
              This server is configured with a single C: partition formed from two physical drives in RAID 1.
            Build Console Output
            Started by timer
            Building in workspace C:\Jenkins\jobs\Maxview-Daily-Build-6.2-WINDOWS-Trunk\workspace
            Cleaning local Directory .
            java.nio.file.FileSystemException: C:\Jenkins\jobs\Maxview-Daily-Build-6.2-WINDOWS-Trunk\workspace\.\.repository\ant\ant-antlr\1.6.5\ant-antlr-1.6.5.jar: The process cannot access the file because it is being used by another process.
            
            	at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
            	at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
            	at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
            	at sun.nio.fs.WindowsFileSystemProvider.implDelete(Unknown Source)
            	at sun.nio.fs.AbstractFileSystemProvider.delete(Unknown Source)
            	at java.nio.file.Files.delete(Unknown Source)
            	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
            	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
            	at java.lang.reflect.Method.invoke(Unknown Source)
            	at hudson.Util.deleteFile(Util.java:237)
            	at hudson.Util.deleteRecursive(Util.java:305)
            	at hudson.Util.deleteContentsRecursive(Util.java:202)
            	at hudson.Util.deleteRecursive(Util.java:296)
            	at hudson.Util.deleteContentsRecursive(Util.java:202)
            	at hudson.Util.deleteRecursive(Util.java:296)
            	at hudson.Util.deleteContentsRecursive(Util.java:202)
            	at hudson.Util.deleteRecursive(Util.java:296)
            	at hudson.Util.deleteContentsRecursive(Util.java:202)
            	at hudson.Util.deleteRecursive(Util.java:296)
            	at hudson.Util.deleteContentsRecursive(Util.java:202)
            	at hudson.scm.subversion.CheckoutUpdater$1.perform(CheckoutUpdater.java:75)
            	at hudson.scm.subversion.WorkspaceUpdater$UpdateTask.delegateTo(WorkspaceUpdater.java:153)
            	at hudson.scm.SubversionSCM$CheckOutTask.perform(SubversionSCM.java:903)
            	at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:884)
            	at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:867)
            	at hudson.FilePath.act(FilePath.java:905)
            	at hudson.FilePath.act(FilePath.java:878)
            	at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:843)
            	at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:781)
            	at hudson.model.AbstractProject.checkout(AbstractProject.java:1369)
            	at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:676)
            	at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
            	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:581)
            	at hudson.model.Run.execute(Run.java:1576)
            	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:486)
            	at hudson.model.ResourceController.execute(ResourceController.java:88)
            	at hudson.model.Executor.run(Executor.java:241)
            
            Show
            brian3791 Brian Brooks added a comment - - edited We are encountering a similar problem that I originally attributed to some kind of weird conflict between Use private Maven repository and SCM / Subversion / Check-out Strategy / Always checkout a fresh copy Not sure why a Maven repo entry local to the workspace would be locked before the code is even checked out. Maven shouldn't even be running yet and no process other than the Jenkins job which uses this workspace should be referencing a workspace private maven repo entry. Environment: Jenkins 1.517 Maven 3.4 (-Xmx1536m -XX:MaxPermSize=256m) Java 1.7.0_15-b03 Oracle JVM 64-bit Windows 2008 Server 64-bit Clean server with no virus scanner, indexing, etc. Dell PowerEdge 2950 PERC 5i Serial Attached SCSI controller This machine has 2 CPUs with 4 cores each (a total of 8 cores). This server is configured with a single C: partition formed from two physical drives in RAID 1. Build Console Output Started by timer Building in workspace C:\Jenkins\jobs\Maxview-Daily-Build-6.2-WINDOWS-Trunk\workspace Cleaning local Directory . java.nio.file.FileSystemException: C:\Jenkins\jobs\Maxview-Daily-Build-6.2-WINDOWS-Trunk\workspace\.\.repository\ant\ant-antlr\1.6.5\ant-antlr-1.6.5.jar: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(Unknown Source) at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source) at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source) at sun.nio.fs.WindowsFileSystemProvider.implDelete(Unknown Source) at sun.nio.fs.AbstractFileSystemProvider.delete(Unknown Source) at java.nio.file.Files.delete(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at hudson.Util.deleteFile(Util.java:237) at hudson.Util.deleteRecursive(Util.java:305) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.Util.deleteRecursive(Util.java:296) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.Util.deleteRecursive(Util.java:296) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.Util.deleteRecursive(Util.java:296) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.Util.deleteRecursive(Util.java:296) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.scm.subversion.CheckoutUpdater$1.perform(CheckoutUpdater.java:75) at hudson.scm.subversion.WorkspaceUpdater$UpdateTask.delegateTo(WorkspaceUpdater.java:153) at hudson.scm.SubversionSCM$CheckOutTask.perform(SubversionSCM.java:903) at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:884) at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:867) at hudson.FilePath.act(FilePath.java:905) at hudson.FilePath.act(FilePath.java:878) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:843) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:781) at hudson.model.AbstractProject.checkout(AbstractProject.java:1369) at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:676) at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:581) at hudson.model.Run.execute(Run.java:1576) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:486) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:241)
            Hide
            pjdarton pjdarton added a comment -

            Yes, that's the kind of error you can get when doing any filesystem access on Windows (whether from Java or anything else) - basically, if you're on Windows, ANY file operation can fail (at any point) with a "file locked by another process" error and you need to catch these and retry (as, if you retry after a small delay, whatever process was sabotaging your operation will have moved on).
            It's also the kind of error that I kept getting that prompted me to create this patch, and I can state (with some confidence now) that this fixed it for me.

            Note: under Java, the process sabotaging your file operation might well be your own - if you don't manually close file handles but just rely on the garbage collector to do so, attempts to delete those files will fail until the GC has run. This is why I run the GC as well, just in case (not sure if that was a deciding factor, but it's what Ant does and it worked for me).

            Show
            pjdarton pjdarton added a comment - Yes, that's the kind of error you can get when doing any filesystem access on Windows (whether from Java or anything else) - basically, if you're on Windows, ANY file operation can fail (at any point) with a "file locked by another process" error and you need to catch these and retry (as, if you retry after a small delay, whatever process was sabotaging your operation will have moved on). It's also the kind of error that I kept getting that prompted me to create this patch, and I can state (with some confidence now) that this fixed it for me. Note: under Java, the process sabotaging your file operation might well be your own - if you don't manually close file handles but just rely on the garbage collector to do so, attempts to delete those files will fail until the GC has run. This is why I run the GC as well, just in case (not sure if that was a deciding factor, but it's what Ant does and it worked for me).
            brian3791 Brian Brooks made changes -
            Link This issue is related to JENKINS-17995 [ JENKINS-17995 ]
            Hide
            dhs Dirk Heinrichs added a comment -

            To make things even worse, the message "file locked by another process" doesn't necessarily mean that the file IS locked by another process. Notepad++, for example, prints this message even if the real error is "permission denied". Took me quite some time to find out...

            Show
            dhs Dirk Heinrichs added a comment - To make things even worse, the message "file locked by another process" doesn't necessarily mean that the file IS locked by another process. Notepad++, for example, prints this message even if the real error is "permission denied". Took me quite some time to find out...
            Hide
            pjdarton pjdarton added a comment -

            I've just attached a new patch file "0001-Proposed-solution-to-JENKINS-15331.patch"
            This one is based on the current Jenkins master trunk (at the time of writing, that's aimed at 1.560-SNAPSHOT).

            This is slightly different than the earlier patch:
            1) The configuration for garbage-collection when deletes fail now defaults to "false" on all platforms.
            2) The garbage-collection should now get called if it's enabled (the previous version had a bug).

            Show
            pjdarton pjdarton added a comment - I've just attached a new patch file "0001-Proposed-solution-to- JENKINS-15331 .patch" This one is based on the current Jenkins master trunk (at the time of writing, that's aimed at 1.560-SNAPSHOT). This is slightly different than the earlier patch: 1) The configuration for garbage-collection when deletes fail now defaults to "false" on all platforms. 2) The garbage-collection should now get called if it's enabled (the previous version had a bug).
            pjdarton pjdarton made changes -
            danielbeck Daniel Beck made changes -
            Link This issue is duplicated by JENKINS-10905 [ JENKINS-10905 ]
            Hide
            pjdarton pjdarton added a comment -

            There's an open pull request that fixes this on github, https://github.com/jenkinsci/jenkins/pull/1209
            That superceeds the patches etc here.

            Show
            pjdarton pjdarton added a comment - There's an open pull request that fixes this on github, https://github.com/jenkinsci/jenkins/pull/1209 That superceeds the patches etc here.
            danielbeck Daniel Beck made changes -
            Link This issue is duplicated by JENKINS-3053 [ JENKINS-3053 ]
            danielbeck Daniel Beck made changes -
            Link This issue is duplicated by JENKINS-14808 [ JENKINS-14808 ]
            Hide
            laurent_malvert Laurent Malvert added a comment -

            Also see this issue (and have for a while) up until at least the current 1.5776 release.

            As mentioned by others above, it's a common gripe with NTFS, and sadly your chances of hitting that issue increase considerably with large checkouts/workspaces (it struggles to delete efficiently a large number of files).

            It would be great if this patch could finally be merged into the head.

            Show
            laurent_malvert Laurent Malvert added a comment - Also see this issue (and have for a while) up until at least the current 1.5776 release. As mentioned by others above, it's a common gripe with NTFS, and sadly your chances of hitting that issue increase considerably with large checkouts/workspaces (it struggles to delete efficiently a large number of files). It would be great if this patch could finally be merged into the head.
            Hide
            danielbeck Daniel Beck added a comment -

            The pull request can no longer be merged cleanly as I've commented six weeks ago. Also, a question by Oliver Gondza is unanswered.

            Show
            danielbeck Daniel Beck added a comment - The pull request can no longer be merged cleanly as I've commented six weeks ago. Also, a question by Oliver Gondza is unanswered.
            Hide
            laurent_malvert Laurent Malvert added a comment - - edited

            Also, I'd like to recommend an alternative to the "delay and wait before retrying" strategy... While this one works most of the time, it's not entirely fool-proof as you can only hope that NTFS will release that lock within the timeframe of your delays/retries.

            Generally what serves me best on NTFS systems is to NOT delete large folders (at first), but instead to rename/move them to a different location (where they can be deleted by a batch job). And possibly to recreate the desired folder.

            I actually do this for my maven local repository and most of my development checkouts on my development machine. I have a custom alias that moves things to a temp folder instead of deleting them, and a cron job that regularly deletes that folder. This way you have no lock on the folder you're currently working on.

            Jenkins could very well use a similar approach by moving the data to be disposed of to the Windows temp folder, or to a trash folder of its own choosing to be regularly emptied by an internal task.

            This approach has multiple advantages:

            • solves the locking for sure,
            • no garbage collection required,
            • no artificial delay required,
            • and actually the "delete" operation is now perceived to be considerably faster (as it doesn't really happen, and move operations are close to instantaneous on most file systems).

            Of course it means that at a given time, a lengthy and possibly intensive deletion process will occur in the background, but depending on how you implement it this could be scheduled to be done during periods of inactivity, or according to a planned schedule, or only when running out of disk space, etc...

            Just my 2 cents, but considering that it's not atypical for Jenkins to deal with large folders, it would seem like an good approach for a number of scenarios (new/clean workspaces, deleting build records, deleting jobs, etc...).

            Show
            laurent_malvert Laurent Malvert added a comment - - edited Also, I'd like to recommend an alternative to the "delay and wait before retrying" strategy... While this one works most of the time, it's not entirely fool-proof as you can only hope that NTFS will release that lock within the timeframe of your delays/retries. Generally what serves me best on NTFS systems is to NOT delete large folders (at first), but instead to rename/move them to a different location (where they can be deleted by a batch job). And possibly to recreate the desired folder. I actually do this for my maven local repository and most of my development checkouts on my development machine. I have a custom alias that moves things to a temp folder instead of deleting them, and a cron job that regularly deletes that folder. This way you have no lock on the folder you're currently working on. Jenkins could very well use a similar approach by moving the data to be disposed of to the Windows temp folder, or to a trash folder of its own choosing to be regularly emptied by an internal task. This approach has multiple advantages: solves the locking for sure, no garbage collection required, no artificial delay required, and actually the "delete" operation is now perceived to be considerably faster (as it doesn't really happen, and move operations are close to instantaneous on most file systems). Of course it means that at a given time, a lengthy and possibly intensive deletion process will occur in the background, but depending on how you implement it this could be scheduled to be done during periods of inactivity, or according to a planned schedule, or only when running out of disk space, etc... Just my 2 cents, but considering that it's not atypical for Jenkins to deal with large folders, it would seem like an good approach for a number of scenarios (new/clean workspaces, deleting build records, deleting jobs, etc...).
            Hide
            arpitgold Arpit Nagar added a comment - - edited

            Any Update on this ??

            Show
            arpitgold Arpit Nagar added a comment - - edited Any Update on this ??
            Hide
            apgray Andrew Gray added a comment -

            This is affecting the Allure Reports plugin as well. Is this any closer to a fix?

            Show
            apgray Andrew Gray added a comment - This is affecting the Allure Reports plugin as well. Is this any closer to a fix?
            Hide
            arpitgold Arpit Nagar added a comment -

            It is blocker for us, have you find any solution for this ??

            Show
            arpitgold Arpit Nagar added a comment - It is blocker for us, have you find any solution for this ??
            Hide
            tsondergaard tsondergaard added a comment -

            The problem appears to be eliminated or at least significantly reduced by disabling the "Windows Search" indexing service. Also look out for anti-virus programs causing problems.

            http://www.pcmag.com/slideshow_viewer/0,3253,l=251692&a=251692&po=4,00.asp

            Show
            tsondergaard tsondergaard added a comment - The problem appears to be eliminated or at least significantly reduced by disabling the "Windows Search" indexing service. Also look out for anti-virus programs causing problems. http://www.pcmag.com/slideshow_viewer/0,3253,l=251692&a=251692&po=4,00.asp
            danielbeck Daniel Beck made changes -
            Link This issue is duplicated by JENKINS-24481 [ JENKINS-24481 ]
            Hide
            pjdarton pjdarton added a comment -

            "Been there, done that"
            In my experience, disabling Windows Search and anti-virus merely reduces the problem, e.g. down from a 5% failure rate to a 0.5% failure rate.
            On all my windows build slaves, I've configured Windows Search to only search the start-menu, then disabled the search service entirely, I've configured the anti-virus to exclude the Jenkins build area from its scans and on-access checking, and I was still seeing builds fail every week due to transient file-locking problems.

            After implementing the fix for this (https://github.com/jenkinsci/jenkins/pull/1209) and applying it to my local Jenkins server, I haven't seen a single build fail due to these transient file-locking problems.

            Show
            pjdarton pjdarton added a comment - "Been there, done that" In my experience, disabling Windows Search and anti-virus merely reduces the problem, e.g. down from a 5% failure rate to a 0.5% failure rate. On all my windows build slaves, I've configured Windows Search to only search the start-menu, then disabled the search service entirely, I've configured the anti-virus to exclude the Jenkins build area from its scans and on-access checking, and I was still seeing builds fail every week due to transient file-locking problems. After implementing the fix for this ( https://github.com/jenkinsci/jenkins/pull/1209 ) and applying it to my local Jenkins server, I haven't seen a single build fail due to these transient file-locking problems.
            Hide
            barnard_robert Robert Barnard added a comment -

            We're experiencing the same issue. After we shutdown a job, a process continues and keeps a file open. In our case it's the aopalliance jar.

            Show
            barnard_robert Robert Barnard added a comment - We're experiencing the same issue. After we shutdown a job, a process continues and keeps a file open. In our case it's the aopalliance jar.
            Hide
            pjdarton pjdarton added a comment -

            Update:
            I split the code changes into a refactor of the unit-test code (to make it easier to test this), and the actual enhancement to the deletion code.
            The refactor has been incorporated into Jenkins' core code already. The actual enhancement code changes are in https://github.com/jenkinsci/jenkins/pull/1800 and awaiting merge.

            Show
            pjdarton pjdarton added a comment - Update: I split the code changes into a refactor of the unit-test code (to make it easier to test this), and the actual enhancement to the deletion code. The refactor has been incorporated into Jenkins' core code already. The actual enhancement code changes are in https://github.com/jenkinsci/jenkins/pull/1800 and awaiting merge.
            Hide
            jogipraveen123 praveen kumar jogi added a comment - - edited

            I would like to fail the job if it is unable to complete.

            We have a job to deploy a war file and start the service on windows build machine. However if one of the file opened in the destination directory by any of the process the files were unable to deploy and finally the build was successful. I would like to fail the build instead of success if build consisting of log as specified below. Is there any work around?

            Log:
            02:26:08 c:\resin-3.1.12\webapps\ROOT\WEB-INF\lib\ridl-3.2.1.jar - The process cannot access the file because it is being used by another process.
            02:26:08 c:\resin-3.1.12\webapps\ROOT\WEB-INF\lib\unoil-3.2.1.jar - The process cannot access the file because it is being used by another process.

            Jenkins architecture:
            Master 1.656 (linux)
            couple of windows build slaves

            Show
            jogipraveen123 praveen kumar jogi added a comment - - edited I would like to fail the job if it is unable to complete. We have a job to deploy a war file and start the service on windows build machine. However if one of the file opened in the destination directory by any of the process the files were unable to deploy and finally the build was successful. I would like to fail the build instead of success if build consisting of log as specified below. Is there any work around? Log: 02:26:08 c:\resin-3.1.12\webapps\ROOT\WEB-INF\lib\ridl-3.2.1.jar - The process cannot access the file because it is being used by another process. 02:26:08 c:\resin-3.1.12\webapps\ROOT\WEB-INF\lib\unoil-3.2.1.jar - The process cannot access the file because it is being used by another process. Jenkins architecture: Master 1.656 (linux) couple of windows build slaves
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Peter Darton
            Path:
            core/src/main/java/hudson/Util.java
            core/src/test/java/hudson/UtilTest.java
            http://jenkins-ci.org/commit/jenkins/310c6747625a5e5605ac87c68d02eddaacdc8e0e
            Log:
            FIXED JENKINS-15331 by changing Util.deleteContentsRecursive, Util.deleteFile and Util.deleteRecursive so that they can retry failed deletions.
            The number of deletion attempts and the time it waits between deletes are configurable via system properties (like hudson.Util.noSymlink etc).
            Util.DELETION_MAX is set by -Dhudson.Util.deletionMax. Default is 3 attempts.
            Util.WAIT_BETWEEN_DELETION_RETRIES is set by -Dhudson.Util.deletionRetryWait. Defaults is 100 milliseconds.
            Util.GC_AFTER_FAILED_DELETE is set by -Dhudson.Util.performGCOnFailedDelete. Default is false.

            Added unit-tests for new functionality.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Peter Darton Path: core/src/main/java/hudson/Util.java core/src/test/java/hudson/UtilTest.java http://jenkins-ci.org/commit/jenkins/310c6747625a5e5605ac87c68d02eddaacdc8e0e Log: FIXED JENKINS-15331 by changing Util.deleteContentsRecursive, Util.deleteFile and Util.deleteRecursive so that they can retry failed deletions. The number of deletion attempts and the time it waits between deletes are configurable via system properties (like hudson.Util.noSymlink etc). Util.DELETION_MAX is set by -Dhudson.Util.deletionMax. Default is 3 attempts. Util.WAIT_BETWEEN_DELETION_RETRIES is set by -Dhudson.Util.deletionRetryWait. Defaults is 100 milliseconds. Util.GC_AFTER_FAILED_DELETE is set by -Dhudson.Util.performGCOnFailedDelete. Default is false. Added unit-tests for new functionality.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Daniel Beck
            Path:
            core/src/main/java/hudson/Util.java
            core/src/test/java/hudson/UtilTest.java
            http://jenkins-ci.org/commit/jenkins/240405dfc33e9c9a96a159b36be269b3201567fa
            Log:
            Merge pull request #2026 from pjdarton/fix_jenkins_15331

            [FIX JENKINS-15331] Windows file locking workaround

            Compare: https://github.com/jenkinsci/jenkins/compare/49a65c2bbbd8...240405dfc33e

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Daniel Beck Path: core/src/main/java/hudson/Util.java core/src/test/java/hudson/UtilTest.java http://jenkins-ci.org/commit/jenkins/240405dfc33e9c9a96a159b36be269b3201567fa Log: Merge pull request #2026 from pjdarton/fix_jenkins_15331 [FIX JENKINS-15331] Windows file locking workaround Compare: https://github.com/jenkinsci/jenkins/compare/49a65c2bbbd8...240405dfc33e
            scm_issue_link SCM/JIRA link daemon made changes -
            Status In Progress [ 3 ] Resolved [ 5 ]
            Resolution Fixed [ 1 ]
            jgbiii James Brown made changes -
            Labels lts-candidate
            Hide
            pjdarton pjdarton added a comment -

            Code changes are in Jenkins 2.2 onwards.
            Parameters that control this functionality have been documented on https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties

            Show
            pjdarton pjdarton added a comment - Code changes are in Jenkins 2.2 onwards. Parameters that control this functionality have been documented on https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties
            Hide
            jorgziegler Jörg Ziegler added a comment -

            are there any plans to backport this to LTS/1.651? The issue persists on Windows Server 2012R2 running with 1.651.2.

            Show
            jorgziegler Jörg Ziegler added a comment - are there any plans to backport this to LTS/1.651? The issue persists on Windows Server 2012R2 running with 1.651.2.
            Hide
            danielbeck Daniel Beck added a comment -

            This was not considered for backporting as this issue is an Improvement and not a Bug.

            Now it's too late, the 1.651.3 RC is out.

            Show
            danielbeck Daniel Beck added a comment - This was not considered for backporting as this issue is an Improvement and not a Bug. Now it's too late, the 1.651.3 RC is out.
            Hide
            pjdarton pjdarton added a comment -

            The only reason this was logged as an "improvement" is because the fault really lies within the Windows OS / JRE and not within Jenkins itself, but all the symptoms (the issues that link to this) are bugs from an end-user's point of view - Jenkins builds "fail at random" on Windows (which is a bug), and this "improvement" is the cure.
            i.e. For anyone trying to do builds on Windows, this is a bugfix (as evidenced by all the issues that link to this).

            So, sure, this is an "improvement" - Jenkins now works reliably on Windows, and that's a huge improvement - but the reason I coded this was to fix a whole load of unreliability (aka "bugs") that are seen on Windows.

            This was flagged as an lts-candidate, so I was rather hoping that it'd be backported to the LTS release.
            As it stands now, either all Windows users have to upgrade to Jenkins 2, or they have to build their own LTS version (as I had to) ... or it gets included in the next LTS - You can probably guess which option I'm in favour of

            Show
            pjdarton pjdarton added a comment - The only reason this was logged as an "improvement" is because the fault really lies within the Windows OS / JRE and not within Jenkins itself, but all the symptoms (the issues that link to this) are bugs from an end-user's point of view - Jenkins builds "fail at random" on Windows (which is a bug), and this "improvement" is the cure. i.e. For anyone trying to do builds on Windows, this is a bugfix (as evidenced by all the issues that link to this). So, sure, this is an "improvement" - Jenkins now works reliably on Windows, and that's a huge improvement - but the reason I coded this was to fix a whole load of unreliability (aka "bugs") that are seen on Windows. This was flagged as an lts-candidate, so I was rather hoping that it'd be backported to the LTS release. As it stands now, either all Windows users have to upgrade to Jenkins 2, or they have to build their own LTS version (as I had to) ... or it gets included in the next LTS - You can probably guess which option I'm in favour of
            Hide
            jorgziegler Jörg Ziegler added a comment -

            Thanks pjdarton - this bug is pretty much killing our productivity as it requires manually restarting slaves every few hours. I strongly agree that it's more than an improvement.

            Show
            jorgziegler Jörg Ziegler added a comment - Thanks pjdarton - this bug is pretty much killing our productivity as it requires manually restarting slaves every few hours. I strongly agree that it's more than an improvement.
            Hide
            danielbeck Daniel Beck added a comment -

            pjdarton Not my fault – Oliver Gondža filters for issue type and resolution, and anything that's not a fixed bug doesn't qualify, label or not.

            This could have been corrected before the RC was published, by now it's too late for .3.

            Show
            danielbeck Daniel Beck added a comment - pjdarton Not my fault – Oliver Gondža filters for issue type and resolution, and anything that's not a fixed bug doesn't qualify, label or not. This could have been corrected before the RC was published, by now it's too late for .3.
            Hide
            jorgziegler Jörg Ziegler added a comment -

            Daniel Beck thanks for the quick replies. Is there any field that would need updating in this issue so that it will be included in a .4?

            Show
            jorgziegler Jörg Ziegler added a comment - Daniel Beck thanks for the quick replies. Is there any field that would need updating in this issue so that it will be included in a .4?
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Actually we still can merge it to .3 if Oliver Gondža agrees. But I'm not so happy about it since RC is under testing now.
            Regarding .4, it will unlikely happen according to the current release model. Needs a wide discussion in the developer list.

            BR, Oleg

            Show
            oleg_nenashev Oleg Nenashev added a comment - Actually we still can merge it to .3 if Oliver Gondža agrees. But I'm not so happy about it since RC is under testing now. Regarding .4, it will unlikely happen according to the current release model. Needs a wide discussion in the developer list. BR, Oleg
            Hide
            danielbeck Daniel Beck added a comment -

            We don't do .4's, except when we mess up so badly there's no way around it, but this doesn't qualify.

            Show
            danielbeck Daniel Beck added a comment - We don't do .4's, except when we mess up so badly there's no way around it , but this doesn't qualify.
            Hide
            olivergondza Oliver Gondža added a comment -

            I decided not to squeeze this into .3 (last in its line) for stability's sake. We need to be extra careful as we do not do much testing on windows, unfortunately.

            Show
            olivergondza Oliver Gondža added a comment - I decided not to squeeze this into .3 (last in its line) for stability's sake. We need to be extra careful as we do not do much testing on windows, unfortunately.
            Hide
            olivergondza Oliver Gondža added a comment -

            Consumed by 2.7.X line so need to backport.

            Show
            olivergondza Oliver Gondža added a comment - Consumed by 2.7.X line so need to backport.
            olivergondza Oliver Gondža made changes -
            Labels lts-candidate
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 146054 ] JNJira + In-Review [ 191764 ]

              People

              • Assignee:
                Unassigned
                Reporter:
                pjdarton pjdarton
              • Votes:
                28 Vote for this issue
                Watchers:
                37 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: