Details

    • Similar Issues:

      Description

      The implementation of AtomicFileWriter is not actually atomic.

      There are several issues with it.

      1. the rename first deletes the target giving a window of opportunity for no file to exist.
      2. the rename of the file will rename the file but the data in the file may not have been flushed to disk. In XFS this is a biggy as the metadata and filedata are written separatly - so despite the file being closed there is not fsync call - so if the system crashes you end up with zero length files.
        This has been observed in the wild when the OS hosting jenkins crashed, on restart several of Jenkins files were zero length.

      Jenkins should make use of java.nio to make sure

      1. data is actually synced
      2. it uses atomic operations for move on platforms where it is available.

        Attachments

          Issue Links

            Activity

            teilo James Nord created issue -
            rtyler R. Tyler Croy made changes -
            Field Original Value New Value
            Workflow JNJira [ 171052 ] JNJira + In-Review [ 184143 ]
            jglick Jesse Glick made changes -
            Link This issue relates to JENKINS-36088 [ JENKINS-36088 ]
            jglick Jesse Glick made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            jglick Jesse Glick made changes -
            Assignee Steven Christou [ schristou ]
            jglick Jesse Glick made changes -
            Remote Link This issue links to "PR 2548 (Web Link)" [ 14872 ]
            jglick Jesse Glick made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            Hide
            michaelneale Michael Neale added a comment -

            yes this would be very good. Good catch James Nord - I believe this has been the underlying problem of a lot of things (I might have to take back the bad things I said about xfs, as this seems more likely to be the problem of a block consistent snapshot without xfs_freeze).

            Very nice!

            Show
            michaelneale Michael Neale added a comment - yes this would be very good. Good catch James Nord - I believe this has been the underlying problem of a lot of things (I might have to take back the bad things I said about xfs, as this seems more likely to be the problem of a block consistent snapshot without xfs_freeze). Very nice!
            teilo James Nord made changes -
            Link This issue relates to JENKINS-47324 [ JENKINS-47324 ]
            teilo James Nord made changes -
            Status In Review [ 10005 ] In Progress [ 3 ]
            teilo James Nord made changes -
            Status In Progress [ 3 ] Open [ 1 ]
            teilo James Nord made changes -
            Assignee Steven Christou [ schristou ] James Nord [ teilo ]
            teilo James Nord made changes -
            Assignee James Nord [ teilo ]
            Hide
            jglick Jesse Glick added a comment -

            I have also noticed that Jenkins.getInstance().createProject(Whatever.class, "aux") will fail on Windows in AtomicFileWriter.<init> with Failed to create a temporary file in …\jobs\aux, which is not very meaningful, because an earlier File.mkdirs returned false but this was ignored. If you switch to Files.createDirectories you at least get a java.nio.file.FileSystemException: …\jobs\aux: The directory name is invalid. which is also a bit cryptic but at least points to the issue: the aux part.

            Show
            jglick Jesse Glick added a comment - I have also noticed that Jenkins.getInstance().createProject(Whatever.class, "aux") will fail on Windows in AtomicFileWriter.<init> with Failed to create a temporary file in …\jobs\aux , which is not very meaningful, because an earlier File.mkdirs returned false but this was ignored. If you switch to Files.createDirectories you at least get a java.nio.file.FileSystemException: …\jobs\aux: The directory name is invalid. which is also a bit cryptic but at least points to the issue: the aux part.
            teilo James Nord made changes -
            Labels data
            teilo James Nord made changes -
            Labels data
            teilo James Nord made changes -
            Labels dataloss
            batmat Baptiste Mathus made changes -
            Assignee Baptiste Mathus [ batmat ]
            batmat Baptiste Mathus made changes -
            Description The implementation of AtomicFile is not actually atomic.

            There are several issues with it.

            1) the rename first deletes the target giving a window of opertunity for no file to exist.

            2) the rename of the file will rename the file but the data in the file may not have been flushed to disk. In XFS this is a biggy as the metadata and filedata are written separatly - so despite the file being closed there is not fsync call - so if the system crashes you end up with zero length files.

            This has been observed in the wild when the OS hosting jenkins crashed, on restart several of Jenkins files where zero length.

            Jenkins should make use of java.nio to make sure
            1) data is actually synced
            2) it uses atomic operations for move on platforms where it is available.
            The implementation of {{AtomicFile}} is not actually atomic.

            There are several issues with it.
             # the rename first deletes the target giving a window of opportunity for no file to exist.
             # the rename of the file will rename the file but the data in the file may not have been flushed to disk. In XFS this is a biggy as the metadata and filedata are written separatly - so despite the file being closed there is not fsync call - so if the system crashes you end up with zero length files.
            This has been observed in the wild when the OS hosting jenkins crashed, on restart several of Jenkins files were zero length.

            Jenkins should make use of {{java.nio}} to make sure
             # data is actually synced
             # it uses atomic operations for move on platforms where it is available.
            batmat Baptiste Mathus made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            batmat Baptiste Mathus made changes -
            Description The implementation of {{AtomicFile}} is not actually atomic.

            There are several issues with it.
             # the rename first deletes the target giving a window of opportunity for no file to exist.
             # the rename of the file will rename the file but the data in the file may not have been flushed to disk. In XFS this is a biggy as the metadata and filedata are written separatly - so despite the file being closed there is not fsync call - so if the system crashes you end up with zero length files.
            This has been observed in the wild when the OS hosting jenkins crashed, on restart several of Jenkins files were zero length.

            Jenkins should make use of {{java.nio}} to make sure
             # data is actually synced
             # it uses atomic operations for move on platforms where it is available.
            The implementation of {{AtomicFileWriter}} is not actually atomic.

            There are several issues with it.
             # the rename first deletes the target giving a window of opportunity for no file to exist.
             # the rename of the file will rename the file but the data in the file may not have been flushed to disk. In XFS this is a biggy as the metadata and filedata are written separatly - so despite the file being closed there is not fsync call - so if the system crashes you end up with zero length files.
             This has been observed in the wild when the OS hosting jenkins crashed, on restart several of Jenkins files were zero length.

            Jenkins should make use of {{java.nio}} to make sure
             # data is actually synced
             # it uses atomic operations for move on platforms where it is available.
            batmat Baptiste Mathus made changes -
            Summary AtomicFile isn't Atomic AtomicFileWrite isn't Atomic
            batmat Baptiste Mathus made changes -
            Summary AtomicFileWrite isn't Atomic AtomicFileWriter isn't Atomic
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: christ66
            Path:
            core/src/main/java/hudson/util/AtomicFileWriter.java
            http://jenkins-ci.org/commit/jenkins/0a7b656af6ac5c70a408abb082f95ecd51ee8bd0
            Log:
            JENKINS-34855 Make atomic file write more atomic by using JDK 7 apis.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: christ66 Path: core/src/main/java/hudson/util/AtomicFileWriter.java http://jenkins-ci.org/commit/jenkins/0a7b656af6ac5c70a408abb082f95ecd51ee8bd0 Log: JENKINS-34855 Make atomic file write more atomic by using JDK 7 apis.
            jamesdumay James Dumay made changes -
            Remote Link This issue links to "CloudBees Internal OSS-2561 (Web Link)" [ 18249 ]
            batmat Baptiste Mathus made changes -
            Remote Link Cette demande est liée à "PR-3170 (Lien Web)" [ 18304 ]
            batmat Baptiste Mathus made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            Show
            batmat Baptiste Mathus added a comment - https://github.com/jenkinsci/jenkins/pull/3170  in review
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            The fix has been integrated towards 2.93.

            Baptiste Mathus Steven Christou please assign the label if you want the fix to be backported to 2.89.2/3. I am not 100% sure it is safe to backport it without lots of soak testing

            Show
            oleg_nenashev Oleg Nenashev added a comment - The fix has been integrated towards 2.93. Baptiste Mathus Steven Christou please assign the label if you want the fix to be backported to 2.89.2/3. I am not 100% sure it is safe to backport it without lots of soak testing
            oleg_nenashev Oleg Nenashev made changes -
            Status In Review [ 10005 ] Resolved [ 5 ]
            Resolution Fixed [ 1 ]
            oleg_nenashev Oleg Nenashev made changes -
            Labels dataloss dataloss lts-candidate
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Actually, added the label. But the discussion is required

            Show
            oleg_nenashev Oleg Nenashev added a comment - Actually, added the label. But the discussion is required
            batmat Baptiste Mathus made changes -
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            Hide
            batmat Baptiste Mathus added a comment -

            Oleg Nenashev the issue is only partly fixed.  And probably only improved should we say for now. We agreed to merge the first shot so that we could split the risk and indeed let it soak separately to possibly better understand and narrow down potential issues later on.

            https://github.com/jenkinsci/jenkins/pull/3170 contains the final fix.

            That is why I had kept this JIRA open for now. 

            Show
            batmat Baptiste Mathus added a comment - Oleg Nenashev the issue is only partly fixed.  And probably only improved should we say for now. We agreed to merge the first shot so that we could split the risk and indeed let it soak separately to possibly better understand and narrow down potential issues later on. https://github.com/jenkinsci/jenkins/pull/3170  contains the final fix. That is why I had kept this JIRA open for now. 
            batmat Baptiste Mathus made changes -
            Status Reopened [ 4 ] Open [ 1 ]
            batmat Baptiste Mathus made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            batmat Baptiste Mathus made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            Hide
            batmat Baptiste Mathus added a comment -

            https://github.com/jenkinsci/jenkins/pull/3170 now has a successful build and is ready for review. So, yes, ideally it would be great once it's reviewed and merged if we can have it in the .2, or maybe .3 for whatever reason.

            With .2, with the christmas release break (.2 ETA January 17th), it would make it 4+ weeks of soak testing. So iff we don't see bad feedback or alerts, I think it would be worth it to get it backported indeed. It's a pretty nasty issue, when it shows up. 

            Show
            batmat Baptiste Mathus added a comment - https://github.com/jenkinsci/jenkins/pull/3170  now has a successful build and is ready for review. So, yes, ideally it would be great once it's reviewed and merged if we can have it in the .2, or maybe .3 for whatever reason. With .2, with the christmas release break (.2 ETA January 17th), it would make it 4+ weeks of soak testing. So iff we don't see bad feedback or alerts, I think it would be worth it to get it backported indeed. It's a pretty nasty issue, when it shows up. 
            batmat Baptiste Mathus made changes -
            Comment [ [https://github.com/jenkinsci/jenkins/pull/3170] now has a successful build and is ready for review. So, yes, ideally it would be great once it's reviewed and merged if we can have it in the .2, or maybe .3 for whatever reason.

            With .2, with the christmas release break (.2 ETA January 17th), it would make it 4+ weeks of soak testing. So iff we don't see bad feedback or alerts, I think it would be worth it to get it backported indeed. It's a pretty nasty issue, when it shows up.  ]
            dnusbaum Devin Nusbaum made changes -
            Link This issue relates to JENKINS-48407 [ JENKINS-48407 ]
            Hide
            batmat Baptiste Mathus added a comment -

            FYI an opt-out flag has been added to the ongoing PR: https://github.com/jenkinsci/jenkins/pull/3170 as per Sam Van Oort good points. Reviews welcome.

            Show
            batmat Baptiste Mathus added a comment - FYI an opt-out flag has been added to the ongoing PR: https://github.com/jenkinsci/jenkins/pull/3170  as per Sam Van Oort good points. Reviews welcome.
            Hide
            batmat Baptiste Mathus added a comment -

            Once merged & released, this should probably be picked up in next LTS only once/if JENKINS-48407 is fixed too.

            Show
            batmat Baptiste Mathus added a comment - Once merged & released, this should probably be picked up in next LTS only once/if  JENKINS-48407 is fixed too.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Baptiste Mathus
            Path:
            core/src/main/java/hudson/util/FileChannelWriter.java
            core/src/test/java/hudson/util/FileChannelWriterTest.java
            http://jenkins-ci.org/commit/jenkins/22efbad9450fc95748e0bc3bee04ec49702d7ed2
            Log:
            JENKINS-34855 Create a FileChannelWriter

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Baptiste Mathus Path: core/src/main/java/hudson/util/FileChannelWriter.java core/src/test/java/hudson/util/FileChannelWriterTest.java http://jenkins-ci.org/commit/jenkins/22efbad9450fc95748e0bc3bee04ec49702d7ed2 Log: JENKINS-34855 Create a FileChannelWriter
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Baptiste Mathus
            Path:
            core/src/main/java/hudson/util/AtomicFileWriter.java
            core/src/main/java/hudson/util/FileChannelWriter.java
            core/src/test/java/hudson/util/FileChannelWriterTest.java
            test/src/test/java/hudson/util/AtomicFileWriterPerfTest.java
            http://jenkins-ci.org/commit/jenkins/438e9296f0646d920e19e05c7696d8cec5307a88
            Log:
            Merge pull request #3170 from batmat/JENKINS-34855

            [FIX JENKINS-34855] Create a FileChannelWriter and use it for AtomicFileWriter "enforced" flushing

            Compare: https://github.com/jenkinsci/jenkins/compare/02d0531add35...438e9296f064

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Baptiste Mathus Path: core/src/main/java/hudson/util/AtomicFileWriter.java core/src/main/java/hudson/util/FileChannelWriter.java core/src/test/java/hudson/util/FileChannelWriterTest.java test/src/test/java/hudson/util/AtomicFileWriterPerfTest.java http://jenkins-ci.org/commit/jenkins/438e9296f0646d920e19e05c7696d8cec5307a88 Log: Merge pull request #3170 from batmat/ JENKINS-34855 [FIX JENKINS-34855] Create a FileChannelWriter and use it for AtomicFileWriter "enforced" flushing Compare: https://github.com/jenkinsci/jenkins/compare/02d0531add35...438e9296f064
            batmat Baptiste Mathus made changes -
            Status In Review [ 10005 ] In Progress [ 3 ]
            Hide
            batmat Baptiste Mathus added a comment -

            Fixed towards 2.102, expected to be released in the next ~12 hours.

            Show
            batmat Baptiste Mathus added a comment - Fixed towards 2.102, expected to be released in the next ~12 hours.
            batmat Baptiste Mathus made changes -
            Status In Progress [ 3 ] Resolved [ 5 ]
            Resolution Fixed [ 1 ]
            Hide
            batmat Baptiste Mathus added a comment -

            Reminder: we must pick up JENKINS-48407 with this one if accepted for next LTS.

            Show
            batmat Baptiste Mathus added a comment - Reminder: we must pick up  JENKINS-48407 with this one if accepted for next LTS.
            olivergondza Oliver Gondža made changes -
            Labels dataloss lts-candidate 2.89.4-rejected dataloss lts-candidate
            danielbeck Daniel Beck made changes -
            Labels 2.89.4-rejected dataloss lts-candidate 2.89.4-rejected dataloss
            jamesdumay James Dumay made changes -
            Remote Link This issue links to "CloudBees Internal TIGER-2833 (Web Link)" [ 20538 ]

              People

              • Assignee:
                batmat Baptiste Mathus
                Reporter:
                teilo James Nord
              • Votes:
                1 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: