Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53401

Random FileNotFoundException when creating lots of agents in parallel threads

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Upon creating lots of agents in parallel (Cloud provisioning containers), I see sometimes random exceptions reported moving temporary files to node/config.xml.

      Also:   java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/myagent-5pr7b/atomic4488666319135941520tmp -> /var/jenkins_home/nodes/myagent-5pr7b/config.xml
      		at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
      		at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      		at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396)
      		at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
      		at java.nio.file.Files.move(Files.java:1395)
      		at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:191)
      java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/myagent-5pr7b/atomic4488666319135941520tmp
      	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
      	at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
      	at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
      	at java.nio.file.Files.move(Files.java:1395)
      	at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
      	at hudson.XmlFile.write(XmlFile.java:198)
      	at jenkins.model.Nodes.save(Nodes.java:289)
      	at hudson.util.PersistedList.onModified(PersistedList.java:173)
      	at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
      	at hudson.model.Slave.<init>(Slave.java:198)
      	at hudson.slaves.AbstractCloudSlave.<init>(AbstractCloudSlave.java:51)
      	at org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave.<init>(KubernetesSlave.java:116)
      	at org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave$Builder.build(KubernetesSlave.java:408)
      	at com.cloudbees.jenkins.plugins.kube.PlannedKubernetesSlave.call(PlannedKubernetesSlave.java:122)
      	at com.cloudbees.jenkins.plugins.kube.PlannedKubernetesSlave.call(PlannedKubernetesSlave.java:35)
      	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      I tracked the root cause being the nodeProperties field in hudson.model.Slave.

      If you have a lot of agents created in different threads, this will cause to call Jenkins.get().getNodesObject().save in each thread. This method is not thread-safe, and affects all nodes storage. As a result, in some threads, save() throws an exception because the node has been already processed through another thread.

      In JENKINS-31055, Stephen made Node implement Saveable, which means the persisted lists should be tied to the node instead of the Nodes object. The corresponding save() operation is fine-grained, so the issue would be avoided completely.

        Attachments

          Issue Links

            Activity

            Hide
            nuzz Matt Nuzzaco added a comment -

            This sounds very similar to what I was seeing in a few heavily parallelized jobs. We can easily kick off 200-500 agents in a very short period of time. I've tested v2.143 and so far I haven't seen the failure we were seeing before. Crossing fingers this was the solution. Thanks for the patch.

            Show
            nuzz Matt Nuzzaco added a comment - This sounds very similar to what I was seeing in a few heavily parallelized jobs. We can easily kick off 200-500 agents in a very short period of time. I've tested v2.143 and so far I haven't seen the failure we were seeing before. Crossing fingers this was the solution. Thanks for the patch.
            Hide
            danielbeck Daniel Beck added a comment -

            Addressed in 2.143.

            Show
            danielbeck Daniel Beck added a comment - Addressed in 2.143.
            Hide
            gregcovertsmith Greg Smith added a comment - - edited

            Please forgive me if out-of-line:

            There are reports that there are deadlock issues with EC2 slaves after upgrading to Jenkins LTS 2.138.2, and one of the changes between LTS 2.138.1 and 2.138.2 was this change.  The issue is reported here:  JENKINS-54187

            I don't know the code well enough to say really:  But this change mentions slaves and thread-safety, and that bug is around the creation of slaves and a deadlock, so knowing nothing other than that, and trying to figure out which change caused the deadlock issue, I thought maybe they were related?

            Show
            gregcovertsmith Greg Smith added a comment - - edited Please forgive me if out-of-line: There are reports that there are deadlock issues with EC2 slaves after upgrading to Jenkins LTS 2.138.2, and one of the changes between LTS 2.138.1 and 2.138.2 was this change.  The issue is reported here:  JENKINS-54187 I don't know the code well enough to say really:  But this change mentions slaves and thread-safety, and that bug is around the creation of slaves and a deadlock, so knowing nothing other than that, and trying to figure out which change caused the deadlock issue, I thought maybe they were related?
            Show
            vlatombe Vincent Latombe added a comment - Greg Smith Indeed, it looks like they are related. For other readers: the new save path adds a Queue lock ( https://github.com/jenkinsci/jenkins/blob/9557da32a3550bd98acc9d04728547fcd98b8a15/core/src/main/java/jenkins/model/Nodes.java#L193-L202 ), which wasn't in the previous save path ( https://github.com/jenkinsci/jenkins/blob/9557da32a3550bd98acc9d04728547fcd98b8a15/core/src/main/java/jenkins/model/Nodes.java#L277-L300 ).

              People

              • Assignee:
                vlatombe Vincent Latombe
                Reporter:
                vlatombe Vincent Latombe
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: