Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-51203

Slave.setNodeProperties is unreliable and inefficient

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: core
    • Environment:
      Code that's creating multiple Slave nodes, in parallel, that have nodeProperties.
      e.g. cloud plugins or pipelines running parallel jobs in dynamic slaves.
    • Similar Issues:

      Description

      TL;DR:  Slave.setNodeProperties is unreliable because it's doing stuff it shouldn't.

       

      When a cloud plugin's Cloud.provision method gets called, the plugin code has to create one or more instances of hudson.model.Slave and return them to Jenkins.
      Prior to returning them to Jenkins, it has to populate those instances with all the data required.  This can (e.g. in the case of the docker-plugin) include a call to Slave.setNodeProperties.

      Slave.setNodeProperties calls nodeProperties.replaceBy()nodeProperties are a hudson.util.PersistedList whose owner is Jenkins.getInstance().getNodesObject() (see Slave.java line 150), and PersistedList.replaceBy calls PersistedList.onModified() which then calls Nodes.save() which writes all nodes to disk (which is a non-trivial operation that isn't thread-safe).

      i.e. At present, any change to the node properties for any slave (even those not persisted to disk yet) causes all slaves to get (re-)persisted to disk.

      This is very inefficient when there are a lot of slaves - it should, at most, persist just the slave in question, and in the case where a slave doesn't belong to anyone yet, it shouldn't get persisted at all.
      Worse still, when there's more than one slave being provisioned at a time (e.g. on a busy Jenkins server with a fast cloud), the persistence functionality can throw an exception and cause the entire operation to fail:

      java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-22f02f15d4988b/atomic3648298142656815004tmp
      	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
      	at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
      	at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
      	at java.nio.file.Files.move(Files.java:1395)
      	at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
      	at hudson.XmlFile.write(XmlFile.java:198)
      	at jenkins.model.Nodes.save(Nodes.java:274)
      	at hudson.util.PersistedList.onModified(PersistedList.java:173)
      	at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
      	at hudson.model.Slave.setNodeProperties(Slave.java:299)
      	at ...
      

       

      What should happen instead is that changes to Slave.nodeProperties should only cause those changes to be persisted if the slave itself is persisted.  Changes to Slaves that are not (yet) part of Jenkins' list of Nodes should not cause a re-save of the configuration to disk.
      e.g. much like the logic done in Node.save().

      FYI this was discovered while investigating the cause of numerous docker provisioning failures caused by occasional (unpredictable) persistence failures from hudson.util.AtomicFileWriter (see docker-plugin issue 644). While that plugin code is being enhanced to cope with such failures, the core code shouldn't be causing these failures in the first place.

       

      Suggestion:  I suspect that this could be fixed by having Slave have nodeProperties.owner set to this, because then it'll benefit from the same optimization logic that's already implemented in Node.save().

      Note: Elsewhere, calls to Nodes.save() are wrapped in a Queue.withLock block to prevent concurrent updates; the call from Slave.nodeProperties isn't wrapped like this, which is why the exception happens, but it'd be better to avoid calling it entirely than to simply make it's existing functionality thread-safe.

        Attachments

          Issue Links

            Activity

            Hide
            integer Kanstantsin Shautsou added a comment - - edited

            One more possible chain how to get into this error

            04:25:05 Caused by: java.nio.file.NoSuchFileException: /jenkins/jenkins-work/nodes/yadp877105322/atomic113511861761462995tmp
            04:25:05 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
            04:25:05 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
            04:25:05 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
            04:25:05 	at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
            04:25:05 	at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
            04:25:05 	at java.nio.file.Files.move(Files.java:1395)
            04:25:05 	at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
            04:25:05 	at hudson.XmlFile.write(XmlFile.java:198)
            04:25:05 	at jenkins.model.Nodes.save(Nodes.java:274)
            04:25:05 	at hudson.util.PersistedList.onModified(PersistedList.java:173)
            04:25:05 	at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
            04:25:05 	at hudson.model.Slave.<init>(Slave.java:198)
            04:25:05 	at hudson.slaves.AbstractCloudSlave.<init>(AbstractCloudSlave.java:51)
            
            Show
            integer Kanstantsin Shautsou added a comment - - edited One more possible chain how to get into this error 04:25:05 Caused by: java.nio.file.NoSuchFileException: /jenkins/jenkins-work/nodes/yadp877105322/atomic113511861761462995tmp 04:25:05 at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 04:25:05 at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 04:25:05 at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 04:25:05 at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409) 04:25:05 at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) 04:25:05 at java.nio.file.Files.move(Files.java:1395) 04:25:05 at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206) 04:25:05 at hudson.XmlFile.write(XmlFile.java:198) 04:25:05 at jenkins.model.Nodes.save(Nodes.java:274) 04:25:05 at hudson.util.PersistedList.onModified(PersistedList.java:173) 04:25:05 at hudson.util.PersistedList.replaceBy(PersistedList.java:85) 04:25:05 at hudson.model.Slave.<init>(Slave.java:198) 04:25:05 at hudson.slaves.AbstractCloudSlave.<init>(AbstractCloudSlave.java:51)

              People

              • Assignee:
                Unassigned
                Reporter:
                pjdarton pjdarton
              • Votes:
                2 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: