Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-54540

Pods stuck in error state is not cleaned up

    Details

    • Type: Improvement
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Component/s: kubernetes-plugin
    • Environment:
    • Similar Issues:

      Description

      The majority of my builds run as expected and we run many builds per day. The podTemplate for my builds is:

       

      podTemplate(cloud: 'k8s-houston', label: 'api-build', yaml: """
      apiVersion: v1
      kind: Pod
      metadata:
        name: maven
      spec:
        containers:
        - name: maven
          image: maven:3-jdk-8-alpine
          volumeMounts:
            - name: volume-0
              mountPath: /mvn/.m2nrepo
          command:
          - cat
          tty: true
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
          securityContext:
            runAsUser: 10000
            fsGroup: 10000
      """,
        containers: [
          containerTemplate(name: 'jnlp', image: 'jenkins/jnlp-slave:3.23-1-alpine', args: '${computer.jnlpmac} ${computer.name}', resourceRequestCpu: '250m', resourceRequestMemory: '512Mi'),
          containerTemplate(name: 'pmd', image: 'stash.trinet-devops.com:8443/pmd:pmd-bin-5.5.4', alwaysPullImage: false, ttyEnabled: true, command: 'cat'),
          containerTemplate(name: 'owasp-zap', image: 'stash.trinet-devops.com:8443/owasp-zap:2.7.0', ttyEnabled: true, command: 'cat'),
          containerTemplate(name: 'kubectl', image: 'lachlanevenson/k8s-kubectl:v1.8.7', ttyEnabled: true, command: 'cat'),
          containerTemplate(name: 'dind', image: 'docker:18.01.0-ce-dind', privileged: true, resourceRequestCpu: '20m', resourceRequestMemory: '512Mi',),
          containerTemplate(name: 'docker-cmds', image: 'docker:18.01.0-ce', ttyEnabled: true, command: 'cat', envVars: [envVar(key: 'DOCKER_HOST', value: 'tcp://localhost:2375')]),
        ],
        volumes: [
          persistentVolumeClaim(claimName: 'jenkins-pv-claim', mountPath: '/mvn/.m2nrepo'),
          emptyDirVolume(mountPath: '/var/lib/docker', memory: false)
        ]
      )
      

      However, sometimes a build Pod will get stuck in Error state in kubernetes

       

      ~ # kubectl get pod -o wide
      NAME                                  READY     STATUS    RESTARTS   AGE       IP               NODE
      jenkins-deployment-7849487c9b-nlhln   2/2       Running   4          12d       10.233.92.12     k8s-node-hm-3
      jenkins-slave-7tj0d-ckwbs             11/11     Running   0          31s       10.233.69.176    k8s-node-1
      jenkins-slave-7tj0d-qn3s6             11/11     Running   0          2m        10.233.77.230    k8s-node-hm-2
      jenkins-slave-gz4pw-2dnn5             6/7       Error     0          2d        10.233.123.239   k8s-node-hm-1
      jenkins-slave-m825p-1hjt7             5/5       Running   0          1m        10.233.123.196   k8s-node-hm-1
      jenkins-slave-r59w1-qs283             6/7       Error     0          6d        10.233.76.104    k8s-node-2
      

       

      You can see from the above listing of current pods that one Pod has been sitting around in Error state for 6 days. I have never seen a Pod in this state recover or get cleaned up. Manual intervention is always necessary.

      When I describe the pod, I see that the "jnlp" container is in a bad state (snippet provided)

       

      ~ # kubectl describe pod jenkins-slave-r59w1-qs283
      Name:         jenkins-slave-r59w1-qs283
      Namespace:    jenkins
      Node:         k8s-node-2/10.0.40.9
      Start Time:   Thu, 01 Nov 2018 12:20:06 +0000
      Labels:       jenkins=slave
                    jenkins/api-build=true
      Annotations:  kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container owasp-zap; cpu limit for container owasp-zap; cpu limit for container dind; cpu limit for container maven; cpu request for ...
      Status:       Running
      IP:           10.233.76.104
      Containers:
        ...
        jnlp:
          Container ID:  docker://a08af23511d01c5f9a249c7f8f8383040a5cc70c25a0680fb0bec4c80439ec7e
          Image:         jenkins/jnlp-slave:3.23-1-alpine
          Image ID:      docker-pullable://jenkins/jnlp-slave@sha256:3cffe807013fece5182124b1e09e742f96b084ae832406a287283a258e79391c
          Port:          <none>
          Host Port:     <none>
          Args:
            b39461cef6e0c9a0ab970bf7f6ff664b463d119e8ddc4c8e966f8a77c2dc055f
            jenkins-slave-r59w1-qs283
          State:          Terminated
            Reason:       Error
            Exit Code:    255
            Started:      Thu, 01 Nov 2018 12:20:12 +0000
            Finished:     Thu, 01 Nov 2018 12:21:01 +0000
          Ready:          False
          Restart Count:  0
          Limits:
            cpu:     2
            memory:  4Gi
          Requests:
            cpu:     250m
            memory:  512Mi
          Environment:
            JENKINS_SECRET:      b39461cef6e0c9a0ab970bf7f6ff664b463d119e8ddc4c8e966f8a77c2dc055f
            JENKINS_TUNNEL:      jenkins-service:50000
            JENKINS_AGENT_NAME:  jenkins-slave-r59w1-qs283
            JENKINS_NAME:        jenkins-slave-r59w1-qs283
            JENKINS_URL:         http://jenkins-service:8080/
            HOME:                /home/jenkins
          Mounts:
            /home/jenkins from workspace-volume (rw)
            /mvn/.m2nrepo from volume-0 (rw)
            /var/lib/docker from volume-1 (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from default-token-kmrnj (ro)
      Conditions:
        Type           Status
        Initialized    True
        Ready          False
        PodScheduled   True
      Volumes:
        volume-0:
          Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
          ClaimName:  jenkins-pv-claim
          ReadOnly:   false
        volume-1:
          Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
          Medium:
        workspace-volume:
          Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
          Medium:
        default-token-kmrnj:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  default-token-kmrnj
          Optional:    false
      QoS Class:       Burstable
      Node-Selectors:  <none>
      Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                       node.kubernetes.io/unreachable:NoExecute for 300s
      Events:          <none>
      

      The jnlp container is is in a state of Terminated with reason Error and exit code 255.

       

      When I look at the logs for the above failed container (see attached) and compare it to a healthy container, they look the same up until the failed container shows this message.

       

      Nov 01, 2018 12:20:49 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      Nov 01, 2018 12:20:59 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1 onReconnect
      INFO: Restarting agent via jenkins.slaves.restarter.UnixSlaveRestarter@53d577ce

      It then seems to repeat the first attempt before printing a stacktrace, at which point the container enters the state described above.

      I have also attached the Console Output from the build job associated with this pod. The build job spent "7 hr 41 min waiting" and ended up in a failed state.

      It would be nice to fix this so the Error state was never reached, but the bug I'm pointing out here is that the Pod should be cleaned up when it enters the Error state. Shouldn't the Jenkins kubernetes plugin keep track of this and clean up Pods that end up in this state?

       

       

        Attachments

          Issue Links

            Activity

            Hide
            csanchez Carlos Sanchez added a comment -

            pods in error state are not cleaned by the plugin by default. Have you tried setting podRetention to never ?

            Show
            csanchez Carlos Sanchez added a comment - pods in error state are not cleaned by the plugin by default. Have you tried setting podRetention to never ?
            Hide
            dwatroustrinet Daniel Watrous added a comment -

            podRetention is set to never.

            Show
            dwatroustrinet Daniel Watrous added a comment - podRetention is set to never.
            Hide
            rthakkar Rishi Thakkar added a comment -

            I see this issue as well when podRetention is set to never.

            Show
            rthakkar Rishi Thakkar added a comment - I see this issue as well when podRetention is set to never.
            Hide
            shen3lu4 Lu Shen added a comment -

            We see this issue as well when podRetention is set to Never. Kubernetes plugin version 1.12.3. In our case,  we could get some 100 pods all in error and some pods were started at the same time. 

             

            Show
            shen3lu4 Lu Shen added a comment - We see this issue as well when podRetention is set to Never. Kubernetes plugin version 1.12.3. In our case,  we could get some 100 pods all in error and some pods were started at the same time.   
            Hide
            michael_odell Michael Odell added a comment -

            I see the same thing.  We have podRetention set to never, but out of the probably hundreds of jobs we run per day, a handful of the pods stick around in an error state or in "Running" where only a subset of the containers are running (i.e. READY: 3/4 STATUS: Running).

            FWIW, our jobs don't reuse pods.  I suppose if they did we might ( ? ) see this less often.  Cleanup is not terribly onerous except that it has to be done every day.

            I have a hard time seeing how this can be classified as an enhancement rather than a bug.  Given long enough, this will cause the system to stop working because of resource exhaustion.

            We happen to be on kubernetes 1.12.10, plugin version 1.18.3, and Jenkins version 2.176.3, but we also seen this with slightly older versions of all three, and I believe in testing with newer versions of Jenkins and plugin (but we're not running enough jobs on those newer versions to see it reliably).

             

            Show
            michael_odell Michael Odell added a comment - I see the same thing.  We have podRetention set to never, but out of the probably hundreds of jobs we run per day, a handful of the pods stick around in an error state or in "Running" where only a subset of the containers are running (i.e. READY: 3/4 STATUS: Running). FWIW, our jobs don't reuse pods.  I suppose if they did we might ( ? ) see this less often.  Cleanup is not terribly onerous except that it has to be done every day. I have a hard time seeing how this can be classified as an enhancement rather than a bug.  Given long enough, this will cause the system to stop working because of resource exhaustion. We happen to be on kubernetes 1.12.10, plugin version 1.18.3, and Jenkins version 2.176.3, but we also seen this with slightly older versions of all three, and I believe in testing with newer versions of Jenkins and plugin (but we're not running enough jobs on those newer versions to see it reliably).  
            Hide
            michael_odell Michael Odell added a comment -

            We also saw the problem when podRetention was set to OnError (including the pods that were in Running and not Error) and switched to Never to try to get it to go away.

            Show
            michael_odell Michael Odell added a comment - We also saw the problem when podRetention was set to OnError (including the pods that were in Running and not Error) and switched to Never to try to get it to go away.
            Hide
            alexhraber Alex Raber added a comment - - edited

            bumping this thread – I'm seeing this issue, and it seems to correlate to when the jenkins-master container is moved to another kubernetes node due to resource scaling.

             

            I think this issue can be resolved if jnlp takes in a timeout threshold to allow jnlp to run without failing if connection to master is lost for X seconds.

             

            Apparently, in such an event, error containers are expected and they are essentially zombies that the master will not take care of. The master will however ensure that the jobs resume once master is back online.

             

            Show
            alexhraber Alex Raber added a comment - - edited bumping this thread – I'm seeing this issue, and it seems to correlate to when the jenkins-master container is moved to another kubernetes node due to resource scaling.   I think this issue can be resolved if jnlp takes in a timeout threshold to allow jnlp to run without failing if connection to master is lost for X seconds.   Apparently, in such an event, error containers are expected and they are essentially zombies that the master will not take care of. The master will however ensure that the jobs resume once master is back online.  

              People

              • Assignee:
                Unassigned
                Reporter:
                dwatroustrinet Daniel Watrous
              • Votes:
                3 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: