Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59340

Pipeline hangs when Agent pod is Terminated

    Details

    • Similar Issues:

      Description

      When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

      • the node remains in Jenkins, as disconnected
      • the pipeline hangs forever
      • the pod remains in kubernetes, in Terminated state, with OOMKilled status

      A manual intervention is necessary to fix this situation:

      • Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
      • Deleting the pod manually cause the node to be removed (after about 5 minutes for some reason) and eventually the pipeline to be aborted

      Expected Behavior

      The pipeline should abort automatically and the node be automatically removed.

      How to Reproduce

      We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a jnlp agent with stress-ng: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

      • Create a pipeline that simulate an kubernetes `OOMKilled` during the build:
      pipeline {
        agent {
          kubernetes {
            yaml """
      metadata:
        labels:
          cloudbees.com/master: "dse-team-apac"
          jenkins: "slave"
          jenkins/stress: "true"
      spec:
        containers:
        - name: "jnlp"
          image: "dohbedoh/jnlp-stress-agent:alpine"
          imagePullPolicy: "Always"
          resources:
            limits:
              memory: "128Mi"
              cpu: "0.2"
            requests:
              memory: "100Mi"
              cpu: "0.2"
          securityContext:
            privileged: true
          tty: true
      """
          }
        }
        stages {
          stage('stress') {
            steps {
              sh "stress-ng --vm 2 --vm-bytes 1G  --timeout 30s -v"
            }
          }
        }
      }
      

      The pod should get OOMKilled by kubernetes:

      $ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
      NAME                                                          READY   STATUS      RESTARTS   AGE
      dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj   0/1     OOMKilled   0          3m21s
      

      And the pipeline jobs show the disconnection and hangs forever:

      Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
      [Pipeline] {
      [Pipeline] stage
      [Pipeline] { (stress)
      [Pipeline] sh
      + stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
      stress-ng: debug: [86] 2 processors online, 2 processors configured
      stress-ng: info:  [86] dispatching hogs: 2 vm
      stress-ng: debug: [86] cache allocate: default cache size: 46080K
      stress-ng: debug: [86] starting stressors
      stress-ng: debug: [86] 2 stressors spawned
      stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
      stress-ng: debug: [89] stress-ng-vm using method 'all'
      stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
      stress-ng: debug: [88] stress-ng-vm using method 'all'
      Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
      

        Attachments

          Issue Links

            Activity

            allan_burdajewicz Allan BURDAJEWICZ created issue -
            allan_burdajewicz Allan BURDAJEWICZ made changes -
            Field Original Value New Value
            Description When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

            * the node remains in Jenkins, as disconnected
            * the pipeline hangs forever
            * the pod remains in kubernetes, in Terminated state, with OOMKilled status

            A manual intervention is necessary to fix this situation:

            * Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
            * Deleting the pod manually cause the node to be removed (after about *5 minutes* for some reason) and eventually the pipeline to be aborted

            h3. How to Reproduce

            We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a _jnlp_ agent with _stress-ng_: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

            * Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

            {code}
            pipeline {
              agent {
                kubernetes {
                  yaml """
            metadata:
              labels:
                cloudbees.com/master: "dse-team-apac"
                jenkins: "slave"
                jenkins/stress: "true"
            spec:
              containers:
              - name: "jnlp"
                image: "dohbedoh/jnlp-stress-agent:alpine"
                imagePullPolicy: "Always"
                resources:
                  limits:
                    memory: "128Mi"
                    cpu: "0.2"
                  requests:
                    memory: "100Mi"
                    cpu: "0.2"
                securityContext:
                  privileged: true
                tty: true
            """
                }
              }
              stages {
                stage('stress') {
                  steps {
                    sh "stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v"
                  }
                }
              }
            }
            {code}

            The pod should get OOMKilled by kubernetes:

            {code}
            $ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
            NAME READY STATUS RESTARTS AGE
            dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj 0/1 OOMKilled 0 3m21s
            {code}

            And the pipeline jobs show the disconnection and hangs forever:

            {code}
            Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
            [Pipeline] {
            [Pipeline] stage
            [Pipeline] { (stress)
            [Pipeline] sh
            + stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
            stress-ng: debug: [86] 2 processors online, 2 processors configured
            stress-ng: info: [86] dispatching hogs: 2 vm
            stress-ng: debug: [86] cache allocate: default cache size: 46080K
            stress-ng: debug: [86] starting stressors
            stress-ng: debug: [86] 2 stressors spawned
            stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
            stress-ng: debug: [89] stress-ng-vm using method 'all'
            stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
            stress-ng: debug: [88] stress-ng-vm using method 'all'
            Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
            {code}
            When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

            * the node remains in Jenkins, as disconnected
            * the pipeline hangs forever
            * the pod remains in kubernetes, in Terminated state, with OOMKilled status

            A manual intervention is necessary to fix this situation:

            * Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
            * Deleting the pod manually cause the node to be removed (after about *5 minutes* for some reason) and eventually the pipeline to be aborted

            h3. Expected Behavior

            The pipeline should abort automatically and the node be automatically removed.

            h3. How to Reproduce

            We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a _jnlp_ agent with _stress-ng_: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

            * Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

            {code}
            pipeline {
              agent {
                kubernetes {
                  yaml """
            metadata:
              labels:
                cloudbees.com/master: "dse-team-apac"
                jenkins: "slave"
                jenkins/stress: "true"
            spec:
              containers:
              - name: "jnlp"
                image: "dohbedoh/jnlp-stress-agent:alpine"
                imagePullPolicy: "Always"
                resources:
                  limits:
                    memory: "128Mi"
                    cpu: "0.2"
                  requests:
                    memory: "100Mi"
                    cpu: "0.2"
                securityContext:
                  privileged: true
                tty: true
            """
                }
              }
              stages {
                stage('stress') {
                  steps {
                    sh "stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v"
                  }
                }
              }
            }
            {code}

            The pod should get OOMKilled by kubernetes:

            {code}
            $ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
            NAME READY STATUS RESTARTS AGE
            dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj 0/1 OOMKilled 0 3m21s
            {code}

            And the pipeline jobs show the disconnection and hangs forever:

            {code}
            Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
            [Pipeline] {
            [Pipeline] stage
            [Pipeline] { (stress)
            [Pipeline] sh
            + stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
            stress-ng: debug: [86] 2 processors online, 2 processors configured
            stress-ng: info: [86] dispatching hogs: 2 vm
            stress-ng: debug: [86] cache allocate: default cache size: 46080K
            stress-ng: debug: [86] starting stressors
            stress-ng: debug: [86] 2 stressors spawned
            stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
            stress-ng: debug: [89] stress-ng-vm using method 'all'
            stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
            stress-ng: debug: [88] stress-ng-vm using method 'all'
            Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
            {code}
            allan_burdajewicz Allan BURDAJEWICZ made changes -
            allan_burdajewicz Allan BURDAJEWICZ made changes -
            Attachment kubernetes-plugin-fine.log [ 48726 ]
            Attachment durabletask-and-workflowdurabletask-fine.log [ 48727 ]
            Attachment build.log [ 48728 ]
            Attachment agent-oom-killed-description.txt [ 48729 ]
            allan_burdajewicz Allan BURDAJEWICZ made changes -
            Summary Pipeline hangs when Agent pod is Terminated but still exist Pipeline hangs when Agent pod is Terminated
            allan_burdajewicz Allan BURDAJEWICZ made changes -
            Link This issue relates to JENKINS-49707 [ JENKINS-49707 ]
            jglick Jesse Glick made changes -
            Remote Link This issue links to "CloudBees-internal issue (Web Link)" [ 23615 ]

              People

              • Assignee:
                Unassigned
                Reporter:
                allan_burdajewicz Allan BURDAJEWICZ
              • Votes:
                1 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: