Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53926

Builds hang ocassionally on resume after Jenkins restart

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • kubernetes-plugin
    • None
    • Jenkins 2.138.1, Kubernetes plugin 1.12.6, Kubernetes 1.11

      Occasionally, we observe an issue with resuming jobs after a Jenkins Master restart.

      Our configuration. Jenkins 2.138.1 (some older versions had the same problem) with the init.groovy.d configuration + persistent jobs running on top of the Kubernetes cluster. Slaves/executors are setup with the Kubernetes plugin 1.12.6 on Kubernetes 1.11.

      After the Jenkins master restart a simple job (a git clone + a shell script) sometimes resume properly:

      15:19:18 Commit message: "Quick commit"
      15:19:18  > git rev-list --no-walk aaaaaa9bd4a093e0364df0f52f5447c15f3785f0 # timeout=10
      [Pipeline] }
      [Pipeline] // dir
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] stage
      [Pipeline] { (Execute my fancy shell script)
      [Pipeline] sh
      Resuming build at Wed Oct 01 13:19:59 GMT 2018 after Jenkins restart
      Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
      15:19:19 [xxxxx-xxxxx] Running shell script
      15:19:22 + vault login '-method=aws' 'role=jenkins'
      Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
      Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
      Waiting to resume part of xxxxx-xxxxx #40
      Ready to run at Wed Oct 01 13:20:13 GMT 2018
      15:20:13 Agent jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx is provisioned from template Kubernetes Pod Template
      15:20:13 Agent specification [Kubernetes Pod Template] (jenkins-slave-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx): 
      15:20:14 (script execution performed properly)
      (...)
      Finished: SUCCESS
      

      However, usually it hangs at:

      15:31:50 Running in /home/jenkins/workspace/xxxxx-xxxxx/playbooks
      [Pipeline] {
      [Pipeline] git
      [Pipeline] }
      [Pipeline] // dir
      Resuming build at Wed Oct 01 13:32:39 GMT 2018 after Jenkins restart
      Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
      Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
      Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
      Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
      Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
      Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
      Ready to run at Wed Oct 01 13:33:07 GMT 2018
      15:33:07 Agent jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx is provisioned from template Kubernetes Pod Template
      15:33:07 Agent specification [Kubernetes Pod Template] (jenkins-slave-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx): 
      15:33:07 
      

      and it needs to be aborted.

      From the Jenkins logs of the failed resume I see:

      Oct 01, 2018 1:32:13 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
      Agent has already been launched, activating: {}
      Oct 01, 2018 1:32:13 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
      Agent has already been launched, activating: {}
      Oct 01, 2018 1:33:03 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer
       Computer KubernetesComputer name: jenkins-slave-ansible-update-xxxx-xxxx slave: KubernetesSlave name: jenkins-slave-ansible-update-xxxxx-xxxxx taskAccepted
      Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerStepExecution
      onResume
      Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud
      Building connection to Kubernetes kubernetes URL  namespace default
      Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesFactoryAdapter
      Autoconfiguring Kubernetes client
      Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesFactoryAdapter
      Creating Kubernetes client: KubernetesFactoryAdapter [serviceAddress=, namespace=default, caCertData=null, credentials=null, skipTlsVerify=true, connectTimeout=0, readTimeout=0]
      Oct 01, 2018 1:33:08 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud
      Connected to Kubernetes kubernetes URL 
      

      the next lines in the successful resume (there are not present on failure) are:

      Oct 01, 2018 1:20:15 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
      Launch proc with environment: [AGENT_WORKDIR=/home/jenkins/agent, ...]
      Oct 01, 2018 1:20:16 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
      Executing shell script inside container [myContainer] of pod [jenkins-slave-ansible-update-adfs-mgmt-lfm45-02g27]
      Oct 01, 2018 1:20:16 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
      onOpen : java.util.concurrent.CountDownLatch@7b82ed2d[Count = 0]
      Oct 01, 2018 1:20:17 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
      Launching with env vars: [...]
      Oct 01, 2018 1:20:19 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
      Executing command: "... my shell command ..." 
      Oct 01, 2018 1:20:19 PM INFO org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1 doLaunch
      Created process inside pod: [jenkins-slave-ansible-update-xxxxx-xxxxx], container: [myContainer] with pid:[-1]
      Oct 01, 2018 1:20:44 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
      onClose : java.util.concurrent.CountDownLatch@7b82ed2d[Count = 0]
      Oct 01, 2018 1:20:45 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer
       Computer KubernetesComputer name: jenkins-slave-ansible-update-xxxxx-xxxxx slave: KubernetesSlave name: jenkins-slave-ansible-update-xxxxx-xxxxx taskCompleted
      ...
      

      Looking at the plugin source code didn't help. We don't know the job hangs and Jenkins is not able to carry on with the executor.

      Do you have any suggestion how it could be fixed or at least investigated deeply?

            Unassigned Unassigned
            emszpak Marcin Zajączkowski
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: