# Workaround Windows unpredictable file locking in Util.deleteContentsRecursive

#### Details

• Type: Improvement
• Status: Resolved
• Priority: Major
• Resolution: Fixed
• Component/s:
• Labels:
None
• Environment:
Microsoft Windows
• Similar Issues:

#### Description

Please enhance the hudson.Util.deleteContentsRecursive method to:

1. delete everything it can
2. try several times to delete everything
3. only throw an exception if it can't delete everything (listing everything that it can't delete)

Reasoning...
Unlike unix, the Microsoft Windows OS does not allow a file to be deleted if something has that file open. This causes delete operations to fail.
Furthermore, most installations of Windows have software that monitors the filesystem for activity and then inspects the contents of recently added/removed files (which means that it'll lock them, albeit temporarily), e.g. the Windows Search service & anti-virus software to name but two (but Windows Vista & Windows 7 seem to have additional complications)

This means that builds which rely on cleaning a workspace before they start will sometimes fail (claiming that they couldn't delete everything because a file was locked), resulting in a build failing with the following output:

Started by an SCM change
Building remotely on jenkinsslave27 in workspace C:\hudsonSlave\workspace\MyProject
Purging workspace...
hudson.util.IOException2: remote file operation failed: C:\hudsonSlave\workspace\MyProject at hudson.remoting.Channel@6f0564d7:jenkinsslave27
at hudson.FilePath.act(FilePath.java:835)
at hudson.FilePath.act(FilePath.java:821)
at hudson.plugins.accurev.AccurevSCM.checkout(AccurevSCM.java:331)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1218)
at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:586) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:475)
at hudson.model.Run.run(Run.java:1434)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:239)
Caused by: java.io.IOException: Unable to delete C:\hudsonSlave\workspace\MyProject\...\src\...\foo - files in dir: [C:\hudsonSlave\workspace\MyProject\...\src\...\foo\bar]
at hudson.Util.deleteFile(Util.java:236)
at hudson.Util.deleteRecursive(Util.java:287)
at hudson.Util.deleteContentsRecursive(Util.java:198)
at hudson.Util.deleteRecursive(Util.java:278)
at hudson.Util.deleteContentsRecursive(Util.java:198)
at hudson.Util.deleteRecursive(Util.java:278)
at hudson.Util.deleteContentsRecursive(Util.java:198)
at hudson.Util.deleteRecursive(Util.java:278)
at hudson.Util.deleteContentsRecursive(Util.java:198)
at hudson.Util.deleteRecursive(Util.java:278)
at hudson.Util.deleteContentsRecursive(Util.java:198)
at hudson.Util.deleteRecursive(Util.java:278)
at hudson.Util.deleteContentsRecursive(Util.java:198)
at hudson.Util.deleteRecursive(Util.java:278)
at hudson.Util.deleteContentsRecursive(Util.java:198)
at hudson.Util.deleteRecursive(Util.java:278)
at hudson.Util.deleteContentsRecursive(Util.java:198)
at hudson.plugins.accurev.PurgeWorkspaceContents.invoke(PurgeWorkspaceContents.java:28)
at hudson.plugins.accurev.PurgeWorkspaceContents.invoke(PurgeWorkspaceContents.java:11)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2161) at hudson.remoting.UserRequest.perform(UserRequest.java:118) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:287)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at hudson.remoting.Engine$1$1.run(Engine.java:60)


What's needed is a retry mechanism. i.e. the equivalent of using Ant's <retry><delete file="foo"/></retry>, but with a (small) delay between attempts (and maybe a call to the garbage collector, just in case the process holding the file open is the build slave process itself).

#### Activity

Hide

Note: This file locking behavior also causes non-Jenkins issues, e.g. deleting multiple folders using Windows explorer will sometimes leave one (usually empty) folder behind, and even a simple "RD /S /Q MyFolder" will sometimes fail to delete the folder on its first attempt. In these cases, simply retrying the operation will succeed. Personally, I think it's a Windows "feature".

As a workaround, I've wrapped most of my calls to Ant's <delete> task in <retry>, and this has eliminated this problem from any of my builds that manage to start BUT this doesn't help if Jenkins doesn't get as far as running my builds.
e.g. I'm using the accurev plugin for my SCM and it cleans the working directory before it grabs the source - I typically get about a 1% failure rate at this stage. Whilst 1% is not a blocking issue, it's not reliable, which is not what one wants from a build system.

Personally, I've found that excluding the build areas from Search & anti-virus helps reduce the problem, but it is insufficient to stop these failures completely (at least on Windows 7) - something, somewhere, will still lock files, sometimes, but any investigation (after the build has failed failed) shows that no process has the file "open".

Show
pjdarton added a comment - Note: This file locking behavior also causes non-Jenkins issues, e.g. deleting multiple folders using Windows explorer will sometimes leave one (usually empty) folder behind, and even a simple "RD /S /Q MyFolder" will sometimes fail to delete the folder on its first attempt. In these cases, simply retrying the operation will succeed. Personally, I think it's a Windows "feature". As a workaround, I've wrapped most of my calls to Ant's <delete> task in <retry>, and this has eliminated this problem from any of my builds that manage to start BUT this doesn't help if Jenkins doesn't get as far as running my builds. e.g. I'm using the accurev plugin for my SCM and it cleans the working directory before it grabs the source - I typically get about a 1% failure rate at this stage. Whilst 1% is not a blocking issue, it's not reliable, which is not what one wants from a build system. Personally, I've found that excluding the build areas from Search & anti-virus helps reduce the problem, but it is insufficient to stop these failures completely (at least on Windows 7) - something, somewhere, will still lock files, sometimes, but any investigation (after the build has failed failed) shows that no process has the file "open".
Hide
pjdarton added a comment - - edited

Features:

• Added two new system properties that control behavior: "Util.deletionRetries" (an integer, defaults to 3) and "Util.deletionRetryWait" (an integer, defaults to 500ms).
• Delete operations that affect directories now try to delete the entire contents of the directory, continuing on to subfolders etc even after encountering files that wouldn't die, before eventually throwing an exception about what wouldn't die. i.e. if a folder has a file "a", "b" and "c", and you can't delete "b", then "a" and "c" would get deleted (and you'll still get the exception about "b").
• Delete operations now have multiple attempts at deleting things, so if not everything could be deleted first time around, maybe they'll get deleted 2nd/3rd etc time around. An exception is only thrown if all retry attempts are exhausted and there are still files/directories that won't delete.
• Added some unit tests for these methods.
• After posting this back in October 2012, I built a version of Jenkins LTS with this patch applied. I've been using it at work for all our development stuff and I've not had file locking problems since. I'm pretty confident that it fixes the problem.

Disclaimers:

• I've not tested this on Linux (or the unit-tests). It should be harmless (behaviorial changes are conditional on being on Windows), but it'd be worth running the unit-tests on Linux just to verify that.
Show
pjdarton added a comment - - edited Features: Added two new system properties that control behavior: "Util.deletionRetries" (an integer, defaults to 3) and "Util.deletionRetryWait" (an integer, defaults to 500ms). Delete operations that affect directories now try to delete the entire contents of the directory, continuing on to subfolders etc even after encountering files that wouldn't die, before eventually throwing an exception about what wouldn't die. i.e. if a folder has a file "a", "b" and "c", and you can't delete "b", then "a" and "c" would get deleted (and you'll still get the exception about "b"). Delete operations now have multiple attempts at deleting things, so if not everything could be deleted first time around, maybe they'll get deleted 2nd/3rd etc time around. An exception is only thrown if all retry attempts are exhausted and there are still files/directories that won't delete. Added some unit tests for these methods. After posting this back in October 2012, I built a version of Jenkins LTS with this patch applied. I've been using it at work for all our development stuff and I've not had file locking problems since. I'm pretty confident that it fixes the problem. Disclaimers: I've not tested this on Linux (or the unit-tests). It should be harmless (behaviorial changes are conditional on being on Windows), but it'd be worth running the unit-tests on Linux just to verify that.
Hide
Show
pjdarton added a comment - JENKINS-15331 should fix JENKINS-10905 .
Hide

Uploaded git patch file; this was produced using the git command-line and isn't claiming to change the entire file. This will probably be a lot easier to merge.

This is my "New-and-improved" solution.
In addition to retrying the deletes, this also calls System.gc() if it's on Windows (a tactic that's also used in Apache Ant's Delete task to workaround the same problem).

Show
pjdarton added a comment - Uploaded git patch file; this was produced using the git command-line and isn't claiming to change the entire file. This will probably be a lot easier to merge. This is my "New-and-improved" solution. In addition to retrying the deletes, this also calls System.gc() if it's on Windows (a tactic that's also used in Apache Ant's Delete task to workaround the same problem).
Hide

Have re-done my GitHub pull request to reflect the new changes (and to fix the CRLF issue with the previous pull request).
New pull request is https://github.com/jenkinsci/jenkins/pull/615

Show
pjdarton added a comment - Have re-done my GitHub pull request to reflect the new changes (and to fix the CRLF issue with the previous pull request). New pull request is https://github.com/jenkinsci/jenkins/pull/615
Hide

I've now been running the LTS Jenkins build (1.480.1) with this patch applied at work for a while.
I've not seen any builds failing due to "file in use" since.
I would therefore recommend that this patch / pull-request be incorporated into the main branch ASAP, and to the next LTS release.

Show
pjdarton added a comment - I've now been running the LTS Jenkins build (1.480.1) with this patch applied at work for a while. I've not seen any builds failing due to "file in use" since. I would therefore recommend that this patch / pull-request be incorporated into the main branch ASAP, and to the next LTS release.
Hide
Daniel Kirkdorffer added a comment -

I believe this is also the root cause of JENKINS-15852. The Git Plugin has a call in GitAPI to FilePath.deleteRecursive(), which in turn calls Util.deleteRecursive(). It is almost immediately trying to delete a workspace that has just been created. Additionally, we have encryption and McAfee software monitoring files that could be locking them.

Show
Daniel Kirkdorffer added a comment - I believe this is also the root cause of JENKINS-15852 . The Git Plugin has a call in GitAPI to FilePath.deleteRecursive(), which in turn calls Util.deleteRecursive(). It is almost immediately trying to delete a workspace that has just been created. Additionally, we have encryption and McAfee software monitoring files that could be locking them.
Hide
pjdarton added a comment - - edited

File-locking is the bane of anyone running any kind of automated system on Windows, so I'd agree that this might well solve the problem (as long as you're sure that the Git code doesn't use the workspace as its current directory, as no amount of retrying will change that).

I also have anti-virus stuff running on my build slaves, and despite that I've not noticed any builds fail due to file-locking issues since I started running a custom build of Jenkins LTS that has this fix in it.
I think that this amounts to a fair amount of circumstantial evidence that this fix works.

Show
pjdarton added a comment - - edited File-locking is the bane of anyone running any kind of automated system on Windows, so I'd agree that this might well solve the problem (as long as you're sure that the Git code doesn't use the workspace as its current directory, as no amount of retrying will change that). I also have anti-virus stuff running on my build slaves, and despite that I've not noticed any builds fail due to file-locking issues since I started running a custom build of Jenkins LTS that has this fix in it. I think that this amounts to a fair amount of circumstantial evidence that this fix works.
Hide
Per Westling added a comment -

This is a very interesting patch, as we encounter a similar bug several times a week.

Will this be added to the Jenkins releases in the near future?

Show
Per Westling added a comment - This is a very interesting patch, as we encounter a similar bug several times a week. Will this be added to the Jenkins releases in the near future?
Hide
Brian Brooks added a comment - - edited

We are encountering a similar problem that I originally attributed to some kind of weird conflict between

Use private Maven repository

and

SCM / Subversion / Check-out Strategy / Always checkout a fresh copy

Not sure why a Maven repo entry local to the workspace would be locked before the code is even checked out. Maven shouldn't even be running yet and no process other than the Jenkins job which uses this workspace should be referencing a workspace private maven repo entry.

Environment:

• Jenkins 1.517
• Maven 3.4 (-Xmx1536m -XX:MaxPermSize=256m)
• Java 1.7.0_15-b03 Oracle JVM 64-bit
• Windows 2008 Server 64-bit
• Clean server with no virus scanner, indexing, etc.
• Dell PowerEdge 2950
• PERC 5i Serial Attached SCSI controller
This machine has 2 CPUs with 4 cores each (a total of 8 cores).
This server is configured with a single C: partition formed from two physical drives in RAID 1.
Build Console Output
Started by timer
Building in workspace C:\Jenkins\jobs\Maxview-Daily-Build-6.2-WINDOWS-Trunk\workspace
Cleaning local Directory .
java.nio.file.FileSystemException: C:\Jenkins\jobs\Maxview-Daily-Build-6.2-WINDOWS-Trunk\workspace\.\.repository\ant\ant-antlr\1.6.5\ant-antlr-1.6.5.jar: The process cannot access the file because it is being used by another process.

at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.WindowsFileSystemProvider.implDelete(Unknown Source)
at sun.nio.fs.AbstractFileSystemProvider.delete(Unknown Source)
at java.nio.file.Files.delete(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at hudson.Util.deleteFile(Util.java:237)
at hudson.Util.deleteRecursive(Util.java:305)
at hudson.Util.deleteContentsRecursive(Util.java:202)
at hudson.Util.deleteRecursive(Util.java:296)
at hudson.Util.deleteContentsRecursive(Util.java:202)
at hudson.Util.deleteRecursive(Util.java:296)
at hudson.Util.deleteContentsRecursive(Util.java:202)
at hudson.Util.deleteRecursive(Util.java:296)
at hudson.Util.deleteContentsRecursive(Util.java:202)
at hudson.Util.deleteRecursive(Util.java:296)
at hudson.Util.deleteContentsRecursive(Util.java:202)
at hudson.scm.subversion.CheckoutUpdater$1.perform(CheckoutUpdater.java:75) at hudson.scm.subversion.WorkspaceUpdater$UpdateTask.delegateTo(WorkspaceUpdater.java:153)
at hudson.scm.SubversionSCM$CheckOutTask.perform(SubversionSCM.java:903) at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:884)
at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:867) at hudson.FilePath.act(FilePath.java:905) at hudson.FilePath.act(FilePath.java:878) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:843) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:781) at hudson.model.AbstractProject.checkout(AbstractProject.java:1369) at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:676)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:581) at hudson.model.Run.execute(Run.java:1576) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:486) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:241)  Show Brian Brooks added a comment - - edited We are encountering a similar problem that I originally attributed to some kind of weird conflict between Use private Maven repository and SCM / Subversion / Check-out Strategy / Always checkout a fresh copy Not sure why a Maven repo entry local to the workspace would be locked before the code is even checked out. Maven shouldn't even be running yet and no process other than the Jenkins job which uses this workspace should be referencing a workspace private maven repo entry. Environment: Jenkins 1.517 Maven 3.4 (-Xmx1536m -XX:MaxPermSize=256m) Java 1.7.0_15-b03 Oracle JVM 64-bit Windows 2008 Server 64-bit Clean server with no virus scanner, indexing, etc. Dell PowerEdge 2950 PERC 5i Serial Attached SCSI controller This machine has 2 CPUs with 4 cores each (a total of 8 cores). This server is configured with a single C: partition formed from two physical drives in RAID 1. Build Console Output Started by timer Building in workspace C:\Jenkins\jobs\Maxview-Daily-Build-6.2-WINDOWS-Trunk\workspace Cleaning local Directory . java.nio.file.FileSystemException: C:\Jenkins\jobs\Maxview-Daily-Build-6.2-WINDOWS-Trunk\workspace\.\.repository\ant\ant-antlr\1.6.5\ant-antlr-1.6.5.jar: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(Unknown Source) at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source) at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source) at sun.nio.fs.WindowsFileSystemProvider.implDelete(Unknown Source) at sun.nio.fs.AbstractFileSystemProvider.delete(Unknown Source) at java.nio.file.Files.delete(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at hudson.Util.deleteFile(Util.java:237) at hudson.Util.deleteRecursive(Util.java:305) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.Util.deleteRecursive(Util.java:296) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.Util.deleteRecursive(Util.java:296) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.Util.deleteRecursive(Util.java:296) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.Util.deleteRecursive(Util.java:296) at hudson.Util.deleteContentsRecursive(Util.java:202) at hudson.scm.subversion.CheckoutUpdater$1.perform(CheckoutUpdater.java:75) at hudson.scm.subversion.WorkspaceUpdater$UpdateTask.delegateTo(WorkspaceUpdater.java:153) at hudson.scm.SubversionSCM$CheckOutTask.perform(SubversionSCM.java:903) at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:884) at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:867) at hudson.FilePath.act(FilePath.java:905) at hudson.FilePath.act(FilePath.java:878) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:843) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:781) at hudson.model.AbstractProject.checkout(AbstractProject.java:1369) at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:676) at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:581) at hudson.model.Run.execute(Run.java:1576) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:486) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:241)
Hide

Yes, that's the kind of error you can get when doing any filesystem access on Windows (whether from Java or anything else) - basically, if you're on Windows, ANY file operation can fail (at any point) with a "file locked by another process" error and you need to catch these and retry (as, if you retry after a small delay, whatever process was sabotaging your operation will have moved on).
It's also the kind of error that I kept getting that prompted me to create this patch, and I can state (with some confidence now) that this fixed it for me.

Note: under Java, the process sabotaging your file operation might well be your own - if you don't manually close file handles but just rely on the garbage collector to do so, attempts to delete those files will fail until the GC has run. This is why I run the GC as well, just in case (not sure if that was a deciding factor, but it's what Ant does and it worked for me).

Show
pjdarton added a comment - Yes, that's the kind of error you can get when doing any filesystem access on Windows (whether from Java or anything else) - basically, if you're on Windows, ANY file operation can fail (at any point) with a "file locked by another process" error and you need to catch these and retry (as, if you retry after a small delay, whatever process was sabotaging your operation will have moved on). It's also the kind of error that I kept getting that prompted me to create this patch, and I can state (with some confidence now) that this fixed it for me. Note: under Java, the process sabotaging your file operation might well be your own - if you don't manually close file handles but just rely on the garbage collector to do so, attempts to delete those files will fail until the GC has run. This is why I run the GC as well, just in case (not sure if that was a deciding factor, but it's what Ant does and it worked for me).
Hide
Dirk Heinrichs added a comment -

To make things even worse, the message "file locked by another process" doesn't necessarily mean that the file IS locked by another process. Notepad++, for example, prints this message even if the real error is "permission denied". Took me quite some time to find out...

Show
Dirk Heinrichs added a comment - To make things even worse, the message "file locked by another process" doesn't necessarily mean that the file IS locked by another process. Notepad++, for example, prints this message even if the real error is "permission denied". Took me quite some time to find out...
Hide

I've just attached a new patch file "0001-Proposed-solution-to-JENKINS-15331.patch"
This one is based on the current Jenkins master trunk (at the time of writing, that's aimed at 1.560-SNAPSHOT).

This is slightly different than the earlier patch:
1) The configuration for garbage-collection when deletes fail now defaults to "false" on all platforms.
2) The garbage-collection should now get called if it's enabled (the previous version had a bug).

Show
pjdarton added a comment - I've just attached a new patch file "0001-Proposed-solution-to- JENKINS-15331 .patch" This one is based on the current Jenkins master trunk (at the time of writing, that's aimed at 1.560-SNAPSHOT). This is slightly different than the earlier patch: 1) The configuration for garbage-collection when deletes fail now defaults to "false" on all platforms. 2) The garbage-collection should now get called if it's enabled (the previous version had a bug).
Hide

There's an open pull request that fixes this on github, https://github.com/jenkinsci/jenkins/pull/1209
That superceeds the patches etc here.

Show
pjdarton added a comment - There's an open pull request that fixes this on github, https://github.com/jenkinsci/jenkins/pull/1209 That superceeds the patches etc here.
Hide
Laurent Malvert added a comment -

Also see this issue (and have for a while) up until at least the current 1.5776 release.

As mentioned by others above, it's a common gripe with NTFS, and sadly your chances of hitting that issue increase considerably with large checkouts/workspaces (it struggles to delete efficiently a large number of files).

It would be great if this patch could finally be merged into the head.

Show
Laurent Malvert added a comment - Also see this issue (and have for a while) up until at least the current 1.5776 release. As mentioned by others above, it's a common gripe with NTFS, and sadly your chances of hitting that issue increase considerably with large checkouts/workspaces (it struggles to delete efficiently a large number of files). It would be great if this patch could finally be merged into the head.
Hide
Daniel Beck added a comment -

The pull request can no longer be merged cleanly as I've commented six weeks ago. Also, a question by Oliver Gondza is unanswered.

Show
Daniel Beck added a comment - The pull request can no longer be merged cleanly as I've commented six weeks ago. Also, a question by Oliver Gondza is unanswered.
Hide
Laurent Malvert added a comment - - edited

Also, I'd like to recommend an alternative to the "delay and wait before retrying" strategy... While this one works most of the time, it's not entirely fool-proof as you can only hope that NTFS will release that lock within the timeframe of your delays/retries.

Generally what serves me best on NTFS systems is to NOT delete large folders (at first), but instead to rename/move them to a different location (where they can be deleted by a batch job). And possibly to recreate the desired folder.

I actually do this for my maven local repository and most of my development checkouts on my development machine. I have a custom alias that moves things to a temp folder instead of deleting them, and a cron job that regularly deletes that folder. This way you have no lock on the folder you're currently working on.

Jenkins could very well use a similar approach by moving the data to be disposed of to the Windows temp folder, or to a trash folder of its own choosing to be regularly emptied by an internal task.

• solves the locking for sure,
• no garbage collection required,
• no artificial delay required,
• and actually the "delete" operation is now perceived to be considerably faster (as it doesn't really happen, and move operations are close to instantaneous on most file systems).

Of course it means that at a given time, a lengthy and possibly intensive deletion process will occur in the background, but depending on how you implement it this could be scheduled to be done during periods of inactivity, or according to a planned schedule, or only when running out of disk space, etc...

Just my 2 cents, but considering that it's not atypical for Jenkins to deal with large folders, it would seem like an good approach for a number of scenarios (new/clean workspaces, deleting build records, deleting jobs, etc...).

Show
Laurent Malvert added a comment - - edited Also, I'd like to recommend an alternative to the "delay and wait before retrying" strategy... While this one works most of the time, it's not entirely fool-proof as you can only hope that NTFS will release that lock within the timeframe of your delays/retries. Generally what serves me best on NTFS systems is to NOT delete large folders (at first), but instead to rename/move them to a different location (where they can be deleted by a batch job). And possibly to recreate the desired folder. I actually do this for my maven local repository and most of my development checkouts on my development machine. I have a custom alias that moves things to a temp folder instead of deleting them, and a cron job that regularly deletes that folder. This way you have no lock on the folder you're currently working on. Jenkins could very well use a similar approach by moving the data to be disposed of to the Windows temp folder, or to a trash folder of its own choosing to be regularly emptied by an internal task. This approach has multiple advantages: solves the locking for sure, no garbage collection required, no artificial delay required, and actually the "delete" operation is now perceived to be considerably faster (as it doesn't really happen, and move operations are close to instantaneous on most file systems). Of course it means that at a given time, a lengthy and possibly intensive deletion process will occur in the background, but depending on how you implement it this could be scheduled to be done during periods of inactivity, or according to a planned schedule, or only when running out of disk space, etc... Just my 2 cents, but considering that it's not atypical for Jenkins to deal with large folders, it would seem like an good approach for a number of scenarios (new/clean workspaces, deleting build records, deleting jobs, etc...).
Hide
Arpit Nagar added a comment - - edited

Any Update on this ??

Show
Arpit Nagar added a comment - - edited Any Update on this ??
Hide
Andrew Gray added a comment -

This is affecting the Allure Reports plugin as well. Is this any closer to a fix?

Show
Andrew Gray added a comment - This is affecting the Allure Reports plugin as well. Is this any closer to a fix?
Hide
Arpit Nagar added a comment -

It is blocker for us, have you find any solution for this ??

Show
Arpit Nagar added a comment - It is blocker for us, have you find any solution for this ??
Hide

The problem appears to be eliminated or at least significantly reduced by disabling the "Windows Search" indexing service. Also look out for anti-virus programs causing problems.

http://www.pcmag.com/slideshow_viewer/0,3253,l=251692&a=251692&po=4,00.asp

Show
tsondergaard added a comment - The problem appears to be eliminated or at least significantly reduced by disabling the "Windows Search" indexing service. Also look out for anti-virus programs causing problems. http://www.pcmag.com/slideshow_viewer/0,3253,l=251692&a=251692&po=4,00.asp
Hide

"Been there, done that"
In my experience, disabling Windows Search and anti-virus merely reduces the problem, e.g. down from a 5% failure rate to a 0.5% failure rate.
On all my windows build slaves, I've configured Windows Search to only search the start-menu, then disabled the search service entirely, I've configured the anti-virus to exclude the Jenkins build area from its scans and on-access checking, and I was still seeing builds fail every week due to transient file-locking problems.

After implementing the fix for this (https://github.com/jenkinsci/jenkins/pull/1209) and applying it to my local Jenkins server, I haven't seen a single build fail due to these transient file-locking problems.

Show
pjdarton added a comment - "Been there, done that" In my experience, disabling Windows Search and anti-virus merely reduces the problem, e.g. down from a 5% failure rate to a 0.5% failure rate. On all my windows build slaves, I've configured Windows Search to only search the start-menu, then disabled the search service entirely, I've configured the anti-virus to exclude the Jenkins build area from its scans and on-access checking, and I was still seeing builds fail every week due to transient file-locking problems. After implementing the fix for this ( https://github.com/jenkinsci/jenkins/pull/1209 ) and applying it to my local Jenkins server, I haven't seen a single build fail due to these transient file-locking problems.
Hide
Robert Barnard added a comment -

We're experiencing the same issue. After we shutdown a job, a process continues and keeps a file open. In our case it's the aopalliance jar.

Show
Robert Barnard added a comment - We're experiencing the same issue. After we shutdown a job, a process continues and keeps a file open. In our case it's the aopalliance jar.
Hide

Update:
I split the code changes into a refactor of the unit-test code (to make it easier to test this), and the actual enhancement to the deletion code.
The refactor has been incorporated into Jenkins' core code already. The actual enhancement code changes are in https://github.com/jenkinsci/jenkins/pull/1800 and awaiting merge.

Show
pjdarton added a comment - Update: I split the code changes into a refactor of the unit-test code (to make it easier to test this), and the actual enhancement to the deletion code. The refactor has been incorporated into Jenkins' core code already. The actual enhancement code changes are in https://github.com/jenkinsci/jenkins/pull/1800 and awaiting merge.
Hide
praveen kumar jogi added a comment - - edited

I would like to fail the job if it is unable to complete.

We have a job to deploy a war file and start the service on windows build machine. However if one of the file opened in the destination directory by any of the process the files were unable to deploy and finally the build was successful. I would like to fail the build instead of success if build consisting of log as specified below. Is there any work around?

Log:
02:26:08 c:\resin-3.1.12\webapps\ROOT\WEB-INF\lib\ridl-3.2.1.jar - The process cannot access the file because it is being used by another process.
02:26:08 c:\resin-3.1.12\webapps\ROOT\WEB-INF\lib\unoil-3.2.1.jar - The process cannot access the file because it is being used by another process.

Jenkins architecture:
Master 1.656 (linux)
couple of windows build slaves

Show
praveen kumar jogi added a comment - - edited I would like to fail the job if it is unable to complete. We have a job to deploy a war file and start the service on windows build machine. However if one of the file opened in the destination directory by any of the process the files were unable to deploy and finally the build was successful. I would like to fail the build instead of success if build consisting of log as specified below. Is there any work around? Log: 02:26:08 c:\resin-3.1.12\webapps\ROOT\WEB-INF\lib\ridl-3.2.1.jar - The process cannot access the file because it is being used by another process. 02:26:08 c:\resin-3.1.12\webapps\ROOT\WEB-INF\lib\unoil-3.2.1.jar - The process cannot access the file because it is being used by another process. Jenkins architecture: Master 1.656 (linux) couple of windows build slaves
Hide

Code changed in jenkins
User: Peter Darton
Path:
core/src/main/java/hudson/Util.java
core/src/test/java/hudson/UtilTest.java
http://jenkins-ci.org/commit/jenkins/310c6747625a5e5605ac87c68d02eddaacdc8e0e
Log:
FIXED JENKINS-15331 by changing Util.deleteContentsRecursive, Util.deleteFile and Util.deleteRecursive so that they can retry failed deletions.
The number of deletion attempts and the time it waits between deletes are configurable via system properties (like hudson.Util.noSymlink etc).
Util.DELETION_MAX is set by -Dhudson.Util.deletionMax. Default is 3 attempts.
Util.WAIT_BETWEEN_DELETION_RETRIES is set by -Dhudson.Util.deletionRetryWait. Defaults is 100 milliseconds.
Util.GC_AFTER_FAILED_DELETE is set by -Dhudson.Util.performGCOnFailedDelete. Default is false.

Show
SCM/JIRA link daemon added a comment - Code changed in jenkins User: Peter Darton Path: core/src/main/java/hudson/Util.java core/src/test/java/hudson/UtilTest.java http://jenkins-ci.org/commit/jenkins/310c6747625a5e5605ac87c68d02eddaacdc8e0e Log: FIXED JENKINS-15331 by changing Util.deleteContentsRecursive, Util.deleteFile and Util.deleteRecursive so that they can retry failed deletions. The number of deletion attempts and the time it waits between deletes are configurable via system properties (like hudson.Util.noSymlink etc). Util.DELETION_MAX is set by -Dhudson.Util.deletionMax. Default is 3 attempts. Util.WAIT_BETWEEN_DELETION_RETRIES is set by -Dhudson.Util.deletionRetryWait. Defaults is 100 milliseconds. Util.GC_AFTER_FAILED_DELETE is set by -Dhudson.Util.performGCOnFailedDelete. Default is false. Added unit-tests for new functionality.
Hide

Code changed in jenkins
User: Daniel Beck
Path:
core/src/main/java/hudson/Util.java
core/src/test/java/hudson/UtilTest.java
http://jenkins-ci.org/commit/jenkins/240405dfc33e9c9a96a159b36be269b3201567fa
Log:
Merge pull request #2026 from pjdarton/fix_jenkins_15331

[FIX JENKINS-15331] Windows file locking workaround

Show
SCM/JIRA link daemon added a comment - Code changed in jenkins User: Daniel Beck Path: core/src/main/java/hudson/Util.java core/src/test/java/hudson/UtilTest.java http://jenkins-ci.org/commit/jenkins/240405dfc33e9c9a96a159b36be269b3201567fa Log: Merge pull request #2026 from pjdarton/fix_jenkins_15331 [FIX JENKINS-15331] Windows file locking workaround Compare: https://github.com/jenkinsci/jenkins/compare/49a65c2bbbd8...240405dfc33e
Hide

Code changes are in Jenkins 2.2 onwards.
Parameters that control this functionality have been documented on https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties

Show
pjdarton added a comment - Code changes are in Jenkins 2.2 onwards. Parameters that control this functionality have been documented on https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties
Hide
Jörg Ziegler added a comment -

are there any plans to backport this to LTS/1.651? The issue persists on Windows Server 2012R2 running with 1.651.2.

Show
Jörg Ziegler added a comment - are there any plans to backport this to LTS/1.651? The issue persists on Windows Server 2012R2 running with 1.651.2.
Hide
Daniel Beck added a comment -

This was not considered for backporting as this issue is an Improvement and not a Bug.

Now it's too late, the 1.651.3 RC is out.

Show
Daniel Beck added a comment - This was not considered for backporting as this issue is an Improvement and not a Bug. Now it's too late, the 1.651.3 RC is out.
Hide

The only reason this was logged as an "improvement" is because the fault really lies within the Windows OS / JRE and not within Jenkins itself, but all the symptoms (the issues that link to this) are bugs from an end-user's point of view - Jenkins builds "fail at random" on Windows (which is a bug), and this "improvement" is the cure.
i.e. For anyone trying to do builds on Windows, this is a bugfix (as evidenced by all the issues that link to this).

So, sure, this is an "improvement" - Jenkins now works reliably on Windows, and that's a huge improvement - but the reason I coded this was to fix a whole load of unreliability (aka "bugs") that are seen on Windows.

This was flagged as an lts-candidate, so I was rather hoping that it'd be backported to the LTS release.
As it stands now, either all Windows users have to upgrade to Jenkins 2, or they have to build their own LTS version (as I had to) ... or it gets included in the next LTS - You can probably guess which option I'm in favour of

Show
pjdarton added a comment - The only reason this was logged as an "improvement" is because the fault really lies within the Windows OS / JRE and not within Jenkins itself, but all the symptoms (the issues that link to this) are bugs from an end-user's point of view - Jenkins builds "fail at random" on Windows (which is a bug), and this "improvement" is the cure. i.e. For anyone trying to do builds on Windows, this is a bugfix (as evidenced by all the issues that link to this). So, sure, this is an "improvement" - Jenkins now works reliably on Windows, and that's a huge improvement - but the reason I coded this was to fix a whole load of unreliability (aka "bugs") that are seen on Windows. This was flagged as an lts-candidate, so I was rather hoping that it'd be backported to the LTS release. As it stands now, either all Windows users have to upgrade to Jenkins 2, or they have to build their own LTS version (as I had to) ... or it gets included in the next LTS - You can probably guess which option I'm in favour of
Hide
Jörg Ziegler added a comment -

Thanks pjdarton - this bug is pretty much killing our productivity as it requires manually restarting slaves every few hours. I strongly agree that it's more than an improvement.

Show
Jörg Ziegler added a comment - Thanks pjdarton - this bug is pretty much killing our productivity as it requires manually restarting slaves every few hours. I strongly agree that it's more than an improvement.
Hide
Daniel Beck added a comment -

pjdarton Not my fault – Oliver Gondža filters for issue type and resolution, and anything that's not a fixed bug doesn't qualify, label or not.

This could have been corrected before the RC was published, by now it's too late for .3.

Show
Daniel Beck added a comment - pjdarton Not my fault – Oliver Gondža filters for issue type and resolution, and anything that's not a fixed bug doesn't qualify, label or not. This could have been corrected before the RC was published, by now it's too late for .3.
Hide
Jörg Ziegler added a comment -

Daniel Beck thanks for the quick replies. Is there any field that would need updating in this issue so that it will be included in a .4?

Show
Jörg Ziegler added a comment - Daniel Beck thanks for the quick replies. Is there any field that would need updating in this issue so that it will be included in a .4?
Hide
Oleg Nenashev added a comment -

Actually we still can merge it to .3 if Oliver Gondža agrees. But I'm not so happy about it since RC is under testing now.
Regarding .4, it will unlikely happen according to the current release model. Needs a wide discussion in the developer list.

BR, Oleg

Show
Oleg Nenashev added a comment - Actually we still can merge it to .3 if Oliver Gondža agrees. But I'm not so happy about it since RC is under testing now. Regarding .4, it will unlikely happen according to the current release model. Needs a wide discussion in the developer list. BR, Oleg
Hide
Daniel Beck added a comment -

We don't do .4's, except when we mess up so badly there's no way around it, but this doesn't qualify.

Show
Daniel Beck added a comment - We don't do .4's, except when we mess up so badly there's no way around it , but this doesn't qualify.
Hide
Oliver Gondža added a comment -

I decided not to squeeze this into .3 (last in its line) for stability's sake. We need to be extra careful as we do not do much testing on windows, unfortunately.

Show
Oliver Gondža added a comment - I decided not to squeeze this into .3 (last in its line) for stability's sake. We need to be extra careful as we do not do much testing on windows, unfortunately.
Hide
Oliver Gondža added a comment -

Consumed by 2.7.X line so need to backport.

Show
Oliver Gondža added a comment - Consumed by 2.7.X line so need to backport.

#### People

• Assignee:
Unassigned
Reporter:
pjdarton