Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-55597

Jenkins JNLP agent disconnects when it fails to delete an old build workspace

    Details

    • Similar Issues:

      Description

      The agent stays up, but the status in the dashboard keeps changing from online to offline and back. The server-side log at https://X/computer/Y/log shows these messages in a loop:

      JNLP agent connected from Y/Z
      Remoting version: 3.28
      This is a Unix agent
      
      deleting obsolete workspace /home/infinispan/workspace/Openshift_Services_PR-141
      
      java.nio.file.FileSystemException: /home/infinispan/workspace/Openshift_Services_PR-141/openshift.local.clusterup/openshift.local.volumes/pods: Operation not permitted
      	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
      	at sun.nio.fs.UnixFileAttributeViews$Posix.setMode(UnixFileAttributeViews.java:238)
      	at sun.nio.fs.UnixFileAttributeViews$Posix.setPermissions(UnixFileAttributeViews.java:260)
      	at java.nio.file.Files.setPosixFilePermissions(Files.java:2045)
      	at hudson.Util.makeWritable(Util.java:337)
      	at hudson.Util.tryOnceDeleteFile(Util.java:297)
      	at hudson.Util.deleteFile(Util.java:272)
      Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from nat-pool-rdu-t.redhat.com/66.187.233.202:45016
      		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743)
      		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
      		at hudson.remoting.Channel.call(Channel.java:957)
      		at hudson.FilePath.act(FilePath.java:1072)
      		at hudson.FilePath.act(FilePath.java:1061)
      		at hudson.FilePath.deleteRecursive(FilePath.java:1268)
      		at jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(WorkspaceLocatorImpl.java:534)
      		at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:692)
      		at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:168)
      		at org.jenkinsci.remoting.engine.JnlpConnectionState$4.invoke(JnlpConnectionState.java:421)
      		at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:312)
      		at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:418)
      		at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler$1.run(JnlpProtocol4Handler.java:334)
      		at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      		at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      Caused: java.io.IOException: Unable to delete '/home/infinispan/workspace/Openshift_Services_PR-141/openshift.local.clusterup/openshift.local.volumes/pods'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
      	at hudson.Util.deleteFile(Util.java:277)
      	at hudson.FilePath.deleteRecursive(FilePath.java:1305)
      	at hudson.FilePath.deleteContentsRecursive(FilePath.java:1314)
      	at hudson.FilePath.deleteRecursive(FilePath.java:1296)
      	at hudson.FilePath.deleteContentsRecursive(FilePath.java:1314)
      	at hudson.FilePath.deleteRecursive(FilePath.java:1296)
      	at hudson.FilePath.deleteContentsRecursive(FilePath.java:1314)
      	at hudson.FilePath.deleteRecursive(FilePath.java:1296)
      	at hudson.FilePath.access$1600(FilePath.java:213)
      	at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1274)
      	at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1270)
      	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3086)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:212)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
      	at hudson.remoting.Request$2.run(Request.java:369)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:93)
      	at java.lang.Thread.run(Thread.java:748)
      ERROR: Connection terminated
      java.nio.channels.ClosedChannelException
      	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
      	at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
      	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
      	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
      	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:784)
      	at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173)
      	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314)
      	at hudson.remoting.Channel.close(Channel.java:1452)
      	at hudson.remoting.Channel.close(Channel.java:1405)
      	at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:173)
      	at org.jenkinsci.remoting.engine.JnlpConnectionState$4.invoke(JnlpConnectionState.java:421)
      	at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:312)
      	at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:418)
      	at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler$1.run(JnlpProtocol4Handler.java:334)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      The agent log doesn't include any of this, even after I configured j.u.l with `.level = ALL`.

      IMO failing to delete a build directory is not a fatal error, it should only affect builds that try to use exactly the same directory.

       

      Additionally, while the node is online, it may start to run a build, but then it immediately goes offline and the build fails with this error:

      Error when executing always post condition:
      org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
      Perhaps you forgot to surround the code with a step that provides this, such as: node
      	at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:261)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:260)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:176)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
      	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
      	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:157)
      	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
      	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:155)
      	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:159)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
      	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
      	at WorkflowScript.run(WorkflowScript:79)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.delegateAndExecute(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:133)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.runPostConditions(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:789)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.catchRequiredContextForNode(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:418)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.catchRequiredContextForNode(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:416)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.runPostConditions(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:788)
      	at com.cloudbees.groovy.cps.CpsDefaultGroovyMethods.each(CpsDefaultGroovyMethods:2030)
      	at com.cloudbees.groovy.cps.CpsDefaultGroovyMethods.each(CpsDefaultGroovyMethods:2015)
      	at com.cloudbees.groovy.cps.CpsDefaultGroovyMethods.each(CpsDefaultGroovyMethods:2056)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.runPostConditions(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:783)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.executePostBuild(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:760)
      	at ___cps.transform___(Native Method)
      	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
      	at sun.reflect.GeneratedMethodAccessor327.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
      	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
      	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
      	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$101(SandboxContinuable.java:34)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.lambda$run0$0(SandboxContinuable.java:59)
      	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:121)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:58)
      	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:182)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:332)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:83)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:244)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:232)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
      

      It would be great if the error message would say explicitly that the build is failing because the node was disconnected, and if the agent-side logs included the error as well.

        Attachments

          Issue Links

            Activity

            danberindei Dan Berindei created issue -
            Hide
            rynlopez Ryan Lopez added a comment -

            We are running into the same problem described above. Jenkins tries to delete files which are created by a Docker script as owned by root with no regular user write permissions. Jenkins tries to clean up the obsolete workspaces and immediately disconnects and reconnects. If anyone knows of a workaround or fix, we are very interested. Thanks!

            Show
            rynlopez Ryan Lopez added a comment - We are running into the same problem described above. Jenkins tries to delete files which are created by a Docker script as owned by root with no regular user write permissions. Jenkins tries to clean up the obsolete workspaces and immediately disconnects and reconnects. If anyone knows of a workaround or fix, we are very interested. Thanks!
            Hide
            xitrium Sean Talts added a comment -

            Same. My workaround so far has just been to log in to the machines and manually rm -rf the entire workspace. Unfortunately I have to do this every time Jenkins is restarted to e.g. install updates.

            Show
            xitrium Sean Talts added a comment - Same. My workaround so far has just been to log in to the machines and manually rm -rf the entire workspace. Unfortunately I have to do this every time Jenkins is restarted to e.g. install updates.
            bitwiseman Liam Newman made changes -
            Field Original Value New Value
            Link This issue is caused by JENKINS-57111 [ JENKINS-57111 ]
            bitwiseman Liam Newman made changes -
            Status Open [ 1 ] Resolved [ 5 ]
            Resolution Fixed [ 1 ]
            Hide
            bitwiseman Liam Newman added a comment -

            I believe this is fixed in Jenkins v2.177.
            Please comment if this should be reopened.

            Show
            bitwiseman Liam Newman added a comment - I believe this is fixed in Jenkins v2.177. Please comment if this should be reopened.
            Hide
            jameshowe James Howe added a comment -

            This also appears to affect SSH agents in 2.164.3, although the exception is a channel EOF instead of the FileSystemException.
            Deleting the directory manually allows the agent to startup correctly.

            Can this fix be ported to LTS please?

            Show
            jameshowe James Howe added a comment - This also appears to affect SSH agents in 2.164.3, although the exception is a channel EOF instead of the FileSystemException. Deleting the directory manually allows the agent to startup correctly. Can this fix be ported to LTS please?
            Hide
            bitwiseman Liam Newman added a comment - - edited

            James Howe
            It is my understanding the LTS backporting is reserved for security issues and we missed the window for 2.164.3.
            The next LTS will be out in June, but the base looks like it will be 2.176 - missed by one. I will see what I can do to get this pulled into that release.

            In the meanwhile, do you have test instance you can try on 2.177 and verify this is fixed? That would be very helpful.

            Show
            bitwiseman Liam Newman added a comment - - edited James Howe It is my understanding the LTS backporting is reserved for security issues and we missed the window for 2.164.3. The next LTS will be out in June, but the base looks like it will be 2.176 - missed by one. I will see what I can do to get this pulled into that release. In the meanwhile, do you have test instance you can try on 2.177 and verify this is fixed? That would be very helpful.
            Hide
            jameshowe James Howe added a comment - - edited

            To justify a security patch, any user who can submit a job that uses Docker can probably use this to make a DoS attack.
            That's how I'm getting it - the docker container runs as root and can write to the workspace, resulting in files the jenkins user cannot delete.

            Show
            jameshowe James Howe added a comment - - edited To justify a security patch, any user who can submit a job that uses Docker can probably use this to make a DoS attack. That's how I'm getting it - the docker container runs as root and can write to the workspace, resulting in files the jenkins user cannot delete.
            Hide
            bitwiseman Liam Newman added a comment -

            Okay, I was wrong about security only, it turns out to be just adding a label, which I've done for JENKINS-57111. We'll see how that goes.

            Show
            bitwiseman Liam Newman added a comment - Okay, I was wrong about security only, it turns out to be just adding a label, which I've done for JENKINS-57111 . We'll see how that goes.
            jameshowe James Howe made changes -
            Component/s ssh-agent-plugin [ 17509 ]
            Component/s ws-cleanup-plugin [ 15930 ]
            Component/s _unsorted [ 19622 ]
            jameshowe James Howe made changes -
            Component/s remoting [ 15489 ]
            Component/s ssh-agent-plugin [ 17509 ]
            Hide
            bitwiseman Liam Newman added a comment -

            James Howe
            This will be in LTS 2.176.1.

            Show
            bitwiseman Liam Newman added a comment - James Howe This will be in LTS 2.176.1.

              People

              • Assignee:
                Unassigned
                Reporter:
                danberindei Dan Berindei
              • Votes:
                4 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: