Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47821

vsphere plugin 2.16 not respecting slave disconnect settings

    Details

    • Similar Issues:

      Description

      Starting in VSphere Plugin 2.16, the behaviour at the end of a job has broken.

      I configure the node to disconnect after 1 build, and to shutdown at that point.  This, along with snapping back to the snapshot upon startup, gives me a guaranteed-clean machine at the start of every build.

      Starting in version 2.16, the plugin seems to opportunistically ignoring the "disconnect after (1) builds", and is re-using the node to run the next queued job without enforcing a snap back to the snapshot.  This next build then has high odds of failing or mis-building, as the node is unclean.

      WORKAROUND: Revert back to plugin version 2.15, where the error does not occur.

        Attachments

          Activity

          Hide
          vmarin Valentin Marin added a comment - - edited

          Slaves are connected via JNLP (windows service, while passing the JNLP secret), remoting version 3.17.

          Show
          vmarin Valentin Marin added a comment - - edited Slaves are connected via JNLP (windows service, while passing the JNLP secret), remoting version 3.17.
          Hide
          jeubank Josiah Eubank added a comment - - edited

          Found a ticket regarding build history and pipelines JENKINS-38877

          Experiencing this still on 2.18, even though the text "Limited Builds is not currently used" no longer appears in the config help.  Note this is combined with "Take this agent offline when not in demand...."

          Show
          jeubank Josiah Eubank added a comment - - edited Found a ticket regarding build history and pipelines JENKINS-38877 Experiencing this still on 2.18, even though the text "Limited Builds is not currently used" no longer appears in the config help.  Note this is combined with "Take this agent offline when not in demand...."
          Hide
          orenchapo Oren Chapo added a comment -

          I've seen this issue also with version 2.16 and 2.18 of the vSphere Cloud plugin, however - it seems like it's not a problem in the plugin, but a limitation of the "cloud" Jenkins interface that the plugin implements.

          If you're trying to ensure a slave is always in a "clean" state when allocated, here's my workaround, after hours of painful google-search, trial and error:
          1. Node configuration: fill the "Snapshot Name" field (eg "Clean")
          2. Node configuration: Availability: "Take this agent online when in demand, and offline when idle"
          3. Node configuration: What to do when the slave is disconnected: "Shutdown"
          4. Pipeline job configuration: include the following code:

          	import jenkins.slaves.*
          	import jenkins.model.*
          	import hudson.slaves.*
          	import hudson.model.*
          	
          	def SafelyDisposeNode() {
          		print "Safely disposing node..."
          		def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave
          		if (slave == null) {
          			error "ERROR: Could not get slave object for node!"
          		}
          		try
          		{
          			slave.getComputer().setTemporarilyOffline(true, null)
          			if(isUnix()) {
          				sh "(sleep 2; poweroff)&"
          			} else {
          				bat "shutdown -t 2 -s"
          			}
          			slave.getComputer().disconnect(null)
          			sleep 10
          		} catch (err) {
          			print "ERROR: could not safely dispose node!"
          		} finally {
          			slave.getComputer().setTemporarilyOffline(false, null)
          		}
          		print "...node safely disposed."
          		slave = null
          	}
          	
          	def DisposableNode(String nodeLabel, Closure body) {
          		node(nodeLabel) {
          			try {
          				body()
          			} catch (err) {
          				throw err
          			} finally {
          				SafelyDisposeNode()
          			}
          		}
          	}
          
          

          5. When you want to ensure the node will NOT be used by another job (or another run of the same job), use a "DisposableNode" block instead of "node" block:

          	DisposableNode('MyNodeLabel') {
          		// run your pipeline code here.
          		// it will make sure the node is shutdown at the end of the block, even if it fails.
          		// no other job or build will be able to use the node in its "dirty" state,
          		// and vSphere plugin will revert to "clean" snapshot before starting the node again.
          	}
          

          6. If other Jobs are using this node (or node label), they all must use the above workaround, to avoid leaving a "dirty" machine for each other.
          7. As of the "why is it so important to have node in a clean state?" question, my use case is integration tests of kernel-mode drivers (both Windows and Linux O/S) that typically "break" the O/S and leave it in an unstable state (BSODs and Kernel Panics are common).
          8. If your pipeline job is running under a Groovy sandbox, you will need to permit some classes (The job will fail and offer you to whitelist a class, repeat carefully several times).

          Show
          orenchapo Oren Chapo added a comment - I've seen this issue also with version 2.16 and 2.18 of the vSphere Cloud plugin, however - it seems like it's not a problem in the plugin, but a limitation of the "cloud" Jenkins interface that the plugin implements. If you're trying to ensure a slave is always in a "clean" state when allocated, here's my workaround, after hours of painful google-search, trial and error: 1. Node configuration: fill the "Snapshot Name" field (eg "Clean") 2. Node configuration: Availability: "Take this agent online when in demand, and offline when idle" 3. Node configuration: What to do when the slave is disconnected: "Shutdown" 4. Pipeline job configuration: include the following code: import jenkins.slaves.* import jenkins.model.* import hudson.slaves.* import hudson.model.* def SafelyDisposeNode() { print "Safely disposing node..." def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave if (slave == null ) { error "ERROR: Could not get slave object for node!" } try { slave.getComputer().setTemporarilyOffline( true , null ) if (isUnix()) { sh "(sleep 2; poweroff)&" } else { bat "shutdown -t 2 -s" } slave.getComputer().disconnect( null ) sleep 10 } catch (err) { print "ERROR: could not safely dispose node!" } finally { slave.getComputer().setTemporarilyOffline( false , null ) } print "...node safely disposed." slave = null } def DisposableNode( String nodeLabel, Closure body) { node(nodeLabel) { try { body() } catch (err) { throw err } finally { SafelyDisposeNode() } } } 5. When you want to ensure the node will NOT be used by another job (or another run of the same job), use a "DisposableNode" block instead of "node" block: DisposableNode( 'MyNodeLabel' ) { // run your pipeline code here. // it will make sure the node is shutdown at the end of the block, even if it fails. // no other job or build will be able to use the node in its "dirty" state, // and vSphere plugin will revert to "clean" snapshot before starting the node again. } 6. If other Jobs are using this node (or node label), they all must use the above workaround, to avoid leaving a "dirty" machine for each other. 7. As of the "why is it so important to have node in a clean state?" question, my use case is integration tests of kernel-mode drivers (both Windows and Linux O/S) that typically "break" the O/S and leave it in an unstable state (BSODs and Kernel Panics are common). 8. If your pipeline job is running under a Groovy sandbox, you will need to permit some classes (The job will fail and offer you to whitelist a class, repeat carefully several times).
          Hide
          jamestelferhta James Telfer added a comment -

          Any progress on this?  I have just come up against what looks like the same issue.  Statically defined Windows slaves connecting via JNLPv4.

          They seem to completely ignore the 'Disconnect After Limited Builds' option, which re-reading the Wiki seems to be the expected behaviour?

          Oren Chapo your work-around doesn't seem to work for me, at least not when using it within declarative pipeline.

          Show
          jamestelferhta James Telfer added a comment - Any progress on this?  I have just come up against what looks like the same issue.  Statically defined Windows slaves connecting via JNLPv4. They seem to completely ignore the 'Disconnect After Limited Builds' option, which re-reading the Wiki seems to be the expected behaviour? Oren Chapo your work-around doesn't seem to work for me, at least not when using it within declarative pipeline.
          Hide
          herrwerr2 Werner Müller added a comment -

          I modified the workaround to reset the vm in the pipeline itself.

          Advantages:

          • Shutdown activities are not required in the  node configuration.
          • The node is resetted before executing the pipeline to the given snapshot

           

          def ResettedNode(String vm, String serverName, String snapshotName, Closure body) {
              node(vm) {
                  // Reset the computer in the context of the node to avoid running other jobs on this node in the meanwhile
                  stage('Reset node')
                  {
                      def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave
                      if (slave == null) {
                          error "ERROR: Could not get slave object for node!"
                      }
                      try
                      {
                          slave.getComputer().setTemporarilyOffline(true, null)
                          vSphere buildStep: [$class: 'PowerOff', vm: vm, evenIfSuspended: true, shutdownGracefully: false, ignoreIfNotExists: false], serverName: serverName
                          vSphere buildStep: [$class: 'RevertToSnapshot', vm: vm, snapshotName: snapshotName], serverName: serverName
                          vSphere buildStep: [$class: 'PowerOn', timeoutInSeconds: 240, vm: vm], serverName: serverName
                          slave.getComputer().disconnect(null)
                          sleep 10 // wait, while the agent on the slave is starting up
                      } catch (err) {
                          print "ERROR: could not reset node!"
                      } finally {
                          slave.getComputer().setTemporarilyOffline(false, null)
                      }
                      slave = null
                  }
              }
              // Wait for node to come online again
              node(vm) {
                  body()
              }
          }
          
          ResettedNode('vm', 'vCloud', 'clean') 
          {
          
          }
          

           

          Show
          herrwerr2 Werner Müller added a comment - I modified the workaround to reset the vm in the pipeline itself. Advantages: Shutdown activities are not required in the  node configuration. The node is resetted before executing the pipeline to the given snapshot   def ResettedNode( String vm, String serverName, String snapshotName, Closure body) { node(vm) { // Reset the computer in the context of the node to avoid running other jobs on this node in the meanwhile stage( 'Reset node' ) { def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave if (slave == null ) { error "ERROR: Could not get slave object for node!" } try { slave.getComputer().setTemporarilyOffline( true , null ) vSphere buildStep: [$class: 'PowerOff' , vm: vm, evenIfSuspended: true , shutdownGracefully: false , ignoreIfNotExists: false ], serverName: serverName vSphere buildStep: [$class: 'RevertToSnapshot' , vm: vm, snapshotName: snapshotName], serverName: serverName vSphere buildStep: [$class: 'PowerOn' , timeoutInSeconds: 240, vm: vm], serverName: serverName slave.getComputer().disconnect( null ) sleep 10 // wait, while the agent on the slave is starting up } catch (err) { print "ERROR: could not reset node!" } finally { slave.getComputer().setTemporarilyOffline( false , null ) } slave = null } } // Wait for node to come online again node(vm) { body() } } ResettedNode( 'vm' , 'vCloud' , 'clean' ) { }  

            People

            • Assignee:
              pjdarton pjdarton
              Reporter:
              alt_jmellor John Mellor
            • Votes:
              4 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated: