Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63441

PodTemplate Container Docker Image Check no longer working

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • kubernetes-plugin
    • None
    • kubernetes 1.27.0

      Previously implemented feature [PR-497|https://github.com/jenkinsci/kubernetes-plugin/pull/497] is no longer working as intended. 

      Following the new implementation of moving the responsibility of cleaning up terminated pods to a Reaper class ( [PR-772|https://github.com/jenkinsci/kubernetes-plugin/pull/772] ), the aforementioned feature is no longer working. The expected behavior is when an invalid Docker image is used for a container, resulting in the pod failing due to an ImagePullBackoff, a corresponding error message is printed to the caller build's console output and the build is canceled/aborted.
      The error message is being printed, but the build is no longer being canceled, resulting in the build continuously looping requesting for a worker pod, having the pod fail and terminate, and a then requesting a new one again.

      The problem occurs due to there being no items in the Queue when the Reaper receives the pod failure event. Thus, when the Reaper goes to check the Queue (here]), it's unable to locate the corresponding Queue Item. And without the Queue Item, it's unable to get a reference to the original job to cancel it.

      Before the change, the Queue Item search was handled by the AllContainersRunningPodWatcher.areAllContainersRunning() method]
      And checking the Queue then gives us a Queue Item.

      So due to the terminating pod clean up responsibility being moved from AllContainersRunningPodWatcher to Reaper , the Queue Item responsible for the pod creation has been removed by the time the Reaper has been notified of the event, resulting in an infinite loop of requesting new pods only for them to fail because the Reaper is not being able to find the corresponding build to cancel.

      We're currently trying submit a fix, but some help would be appreciated that could help us figure out either:

      1. A way to keep the Queue Item for creating the job in the Queue long enough for the Reaper to use it.
      2. A way for the Reaper to be made aware of the event before the Queue Item is removed from the Queue
      3. Or if we need to move the canceling build functionality out of the Reaper and back into the AllContainersRunningPodWatcher

       

      Steps to Recreate Issue:

      1. Create a Jenkinsfile pipeline with a kubernetes agent that specifies a container using a nonexistent Docker image
      2. Build the job.
      3. Infinite loop.

            Unassigned Unassigned
            pyieh Pierson Yieh
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: