Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-55527

Builds fail randomly when running sh in container

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Critical
    • Resolution: Not A Defect
    • Component/s: kubernetes-plugin
    • Labels:
    • Environment:
      Running jenkins in a Kubernetes cluster on GCP
    • Similar Issues:

      Description

      My devs are complaining of builds failing randomly when a stage starts. The builds fail when attempting to run "sh" in a container in the pods running the job.
      Here is the error message I see. 

      [Pipeline] shrpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:87: adding pid 3786794 to cgroups caused \"failed to write 3786794 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/pod70971cd7-153a-11e9-9fe5-42010a567404/6b66fd31d9718f168c34810477e328045af5caead06e9e7f48ed3b9431eb3d37/cgroup.procs: invalid argument\""[Pipeline] echoError: java.io.IOException: Pipe closed...
      ...
      ...
      ERROR: script returned exit code 1
      Finished: FAILURE

        Attachments

          Activity

          Hide
          apowell Andy Powell added a comment -

          We are seeing similar errors.
          rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:87: adding pid 3559144 to cgroups caused \"failed to write 3559144 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/pod03460050-19f2-11e9-beb6-42010a8e01b2/f4978b3f515bcf1f942bd0ed21ce084ca2039e6ae56f357870cbbe55517ed151/cgroup.procs: invalid argument\""

          command terminated with non-zero exit code: Error executing in Docker Container: 126process apparently never started in /home/jenkins/workspace/le_platform-nodejs-hello_sandbox@tmp/durable-11cdeba1
          Jenkins version = 2.150.1, running on GKE

           

          Show
          apowell Andy Powell added a comment - We are seeing similar errors. rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:87: adding pid 3559144 to cgroups caused \"failed to write 3559144 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/pod03460050-19f2-11e9-beb6-42010a8e01b2/f4978b3f515bcf1f942bd0ed21ce084ca2039e6ae56f357870cbbe55517ed151/cgroup.procs: invalid argument\"" command terminated with non-zero exit code: Error executing in Docker Container: 126process apparently never started in /home/jenkins/workspace/le_platform-nodejs-hello_sandbox@tmp/durable-11cdeba1 Jenkins version = 2.150.1, running on GKE  
          Hide
          csanchez Carlos Sanchez added a comment -
          Show
          csanchez Carlos Sanchez added a comment - This looks like https://github.com/moby/moby/issues/31230 and the fix could be in runc v1.0.0-rc6 https://github.com/opencontainers/runc/pull/1916
          Hide
          akamel1001 Ahmed Kamel added a comment -

          Got it. Thanks for posting these. I'll watch the issue over on github.

          Show
          akamel1001 Ahmed Kamel added a comment - Got it. Thanks for posting these. I'll watch the issue over on github.
          Hide
          apowell Andy Powell added a comment -

          Update: We were able to isolate this to a security scanner within our GKE cluster.  Turning it off made the problems go away.  

          Show
          apowell Andy Powell added a comment - Update: We were able to isolate this to a security scanner within our GKE cluster.  Turning it off made the problems go away.  
          Hide
          akamel1001 Ahmed Kamel added a comment -

          Andy Powell are you talking about a GCP service or was it a 3rd party security scanner that was causing the issue?

          Show
          akamel1001 Ahmed Kamel added a comment - Andy Powell are you talking about a GCP service or was it a 3rd party security scanner that was causing the issue?
          Hide
          apowell Andy Powell added a comment -

          Ahmed Kamel it was not GCP, but the 3rd party product that was causing the issue.  GKE is running Jenkins after we turned off the 3rd party service.

          Show
          apowell Andy Powell added a comment - Ahmed Kamel it was not GCP, but the 3rd party product that was causing the issue.  GKE is running Jenkins after we turned off the 3rd party service.
          Hide
          akamel1001 Ahmed Kamel added a comment -

          Go it. We have a very similar setup here.

          Feel free not to comment but was this 3rd party tool Twistlock by any chance? 

          Show
          akamel1001 Ahmed Kamel added a comment - Go it. We have a very similar setup here. Feel free not to comment but was this 3rd party tool Twistlock by any chance? 
          Hide
          apowell Andy Powell added a comment -

          yes it was

          Show
          apowell Andy Powell added a comment - yes it was
          Hide
          csanchez Carlos Sanchez added a comment -

          Thanks for figuring it out

          Show
          csanchez Carlos Sanchez added a comment - Thanks for figuring it out
          Hide
          akamel1001 Ahmed Kamel added a comment -

          Awesome thank you Andy Powell for tracking this down. We have disabled it and saw the error count drop significantly.

          Show
          akamel1001 Ahmed Kamel added a comment - Awesome thank you Andy Powell for tracking this down. We have disabled it and saw the error count drop significantly.
          Hide
          akamel1001 Ahmed Kamel added a comment -

          For whoever stumbles onto this thread.

          Our security team reached out to Twistlock to try and figure out the root cause of this issue. They told us they are aware of the issue and are working on updates. In the meantime here is a nice blog post that explains the issue and how it was found

           

          https://www.twistlock.com/2018/12/04/advanced-runc-debugging-fun-profit/

          Show
          akamel1001 Ahmed Kamel added a comment - For whoever stumbles onto this thread. Our security team reached out to Twistlock to try and figure out the root cause of this issue. They told us they are aware of the issue and are working on updates. In the meantime here is a nice blog post that explains the issue and how it was found   https://www.twistlock.com/2018/12/04/advanced-runc-debugging-fun-profit/

            People

            • Assignee:
              csanchez Carlos Sanchez
              Reporter:
              akamel1001 Ahmed Kamel
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: