Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58290

WebSocket / OkHttp thread leak from BourneShellScript + ContainerExecDecorator

    XMLWordPrintable

    Details

    • Similar Issues:
    • Released As:
      durable-task 1.30

      Description

      The durable-task plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the Proc is discarded.

      Unfortunately, the current implementation in BourneShellScript does not actually allow the Proc to exit until the user process also exits. On a regular agent this does not matter much. But when you run sh steps inside container on a Kubernetes agent, ContainerExecDecorator and ContainerExecProc actually keep a WebSocket open for the duration of the launched process. This consumes resources on the Kubernetes API server; it is possible to run out of connections. It also consumes three master-side Java threads per sh, like

      "OkHttp http://…/..." #361 prio=5 os_prio=0 tid=… nid=… runnable […]
         java.lang.Thread.State: RUNNABLE
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
              at java.net.SocketInputStream.read(SocketInputStream.java:171)
              at java.net.SocketInputStream.read(SocketInputStream.java:141)
              at okio.Okio$2.read(Okio.java:140)
              at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
              at okio.RealBufferedSource.request(RealBufferedSource.java:68)
              at okio.RealBufferedSource.require(RealBufferedSource.java:61)
              at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
              at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
              at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
              at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
              at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
              at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
              at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      
      "OkHttp WebSocket http://…/..." #359 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
         java.lang.Thread.State: TIMED_WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <…> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
              at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
              at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
              at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      
      "pool-73-thread-1" #358 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at io.fabric8.kubernetes.client.utils.NonBlockingInputStreamPumper.run(NonBlockingInputStreamPumper.java:57)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      

      To see the problem, you can run

      while (true) {
          podTemplate(label: BUILD_TAG, containers: [containerTemplate(name: 'ubuntu', image: 'ubuntu', command: 'sleep', args: 'infinity')]) {
              node (BUILD_TAG) {
                  container('ubuntu') {
                      branches = [:] // TODO cannot use collectEntries because: java.io.NotSerializableException: groovy.lang.IntRange
                      for (int x = 0; x < 1000; x += 5) {
                          def _x = x
                          branches["sleep$x"] = {
                              sleep time: _x, unit: 'SECONDS'
                              sh """set +x; while :; do echo -n "$_x "; date; sleep 10; done"""
                          }
                      }
                      parallel branches
                  }
              }
          }
      }
      

      and watch via

      while :; do jstack $pid | fgrep '"' | sort | egrep -i 'ok|pool' > /tmp/x; clear; cat /tmp/x; sleep 5; done
      

        Attachments

          Issue Links

            Activity

            Hide
            daniel_jeznach_img Daniel Jeznach added a comment - - edited

            Hello,
            fix for this issue causes one of our pipelines to stuck. Passing

            -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true
            

            parameter reverts to old behavior, so I blame running sh build step in subshell as a cause of this.

            Pipeline code is:

            pipeline {
                agent { label "xxxxxx" + (env.JOB_NAME.find(/-(u14|u16|u18|w7|w10)(?=-)/) ?:'') }
                
                options {
                    timeout(time: 60, unit: 'MINUTES')
                }
                    
                environment {
                    TMP = "${env.WORKSPACE}/tmp"
                    PROJECT_NAME = "${env.JOB_NAME}".find(/^[^-]+/)
                    PROJECT_BRANCH = "${env.JOB_NAME}".find(/(?<=^[^-]+-)[^-]+/)
                    PROJECT_RELEASE = "${env.PROJECT_BRANCH}".replace('MAIN', '9999rc')
                    P4_BRANCH = "REL/${env.PROJECT_BRANCH}".replace('REL/MAIN', 'MAIN')
                }
                
                parameters {
                    string(name: 'change', defaultValue: '', description: 'Changelist to build.', trim: true)
                    string(name: 'status', defaultValue: '', description: 'Changelist status (shelved/submitted).', trim: true)
                    string(name: 'review', defaultValue: '', description: 'Helix Swarm review number.', trim: true)
                    string(name: 'pass', defaultValue: '', description: 'Success callback URL.', trim: true)
                    string(name: 'fail', defaultValue: '', description: 'Failure callback URL.', trim: true)
                }
                
                triggers {
                    pollSCM('@midnight')
                }
                
                stages {
                    stage('sync') {
                        steps {
                            p4sync(
                                credential: 'xxxxx',
                                format: 'xxxx.xxxx-${NODE_NAME}-${JOB_NAME}-${EXECUTOR_NUMBER}_ws',
                                populate: forceClean(),
                                source: depotSource("//path/${env.P4_BRANCH}/...")
                            )
                    
                            buildName "${env.P4_CHANGELIST} ${params.status?:''} ${params.change?:''}".trim()
                            sh '''sed 's/\\\$Change: [0-9]\\+ \\\$/\$Change: '$P4_CHANGELIST' \$/' -i xxxxx/version.py'''
                                
                            script {
                                def dot = env.PROJECT_RELEASE.find(/rc$/) ? '' : '.'
                                env.PACKAGE_NAME = "${env.PROJECT_NAME}-${env.PROJECT_RELEASE}${dot}${env.P4_CHANGELIST}.tar.gz"
                            }
                            echo "PROJECT_NAME = $PROJECT_NAME"
                            echo "PROJECT_BRANCH = $PROJECT_BRANCH"
                            echo "PROJECT_RELEASE = $PROJECT_RELEASE"
                            echo "P4_BRANCH = $P4_BRANCH"
                            echo "PACKAGE_NAME = $PACKAGE_NAME"
                        }
                    }
                    stage('build') {
                        steps { 
                            sh '''
                            #!/bin/bash -xe
                    
                            # Clean up possible orphans from other test sessions
                            pkill -P1 -U $(id -u)  || true
                            
                            mkdir $TMP
                            
                            virtualenv venv
                            . venv/bin/activate
                            
                            python setup.py sdist
                            pip install dist/${PROJECT_NAME}-${PROJECT_RELEASE}*.tar.gz
                            xxxxxx master --print-requirements | xargs pip install
                            pip install -r tests/test_requirements.txt
                            '''
                        }
                    }
                    stage('test') { 
                        steps { 
                            sh '''
                                . venv/bin/activate
                                pytest -n 4 \
                                  --timeout=300 \
                                  --junit-xml ${WORKSPACE}/testresults.xml \
                                  --verbose \
                                  --cov=${PROJECT_NAME} \
                                  --cov-branch \
                                  --cov-report xml:${WORKSPACE}/coverage.xml \
                                  --replay-record-dir=testreplay \
                                  tests || true
                            '''
                            junit 'testresults.xml'
                            cobertura coberturaReportFile: 'coverage.xml'
                        }
                    }
                }
            }
            

            It never gets to "test" stage, seems that "sh" step in "build" stage never ends.
            If you need, I'll gladly provide more information.

            Show
            daniel_jeznach_img Daniel Jeznach added a comment - - edited Hello, fix for this issue causes one of our pipelines to stuck. Passing -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true parameter reverts to old behavior, so I blame running sh build step in subshell as a cause of this. Pipeline code is: pipeline { agent { label "xxxxxx" + (env.JOB_NAME.find(/-(u14|u16|u18|w7|w10)(?=-)/) ?:'') } options { timeout(time: 60, unit: 'MINUTES' ) } environment { TMP = "${env.WORKSPACE}/tmp" PROJECT_NAME = "${env.JOB_NAME}" .find(/^[^-]+/) PROJECT_BRANCH = "${env.JOB_NAME}" .find(/(?<=^[^-]+-)[^-]+/) PROJECT_RELEASE = "${env.PROJECT_BRANCH}" .replace( 'MAIN' , '9999rc' ) P4_BRANCH = "REL/${env.PROJECT_BRANCH}" .replace( 'REL/MAIN' , 'MAIN' ) } parameters { string(name: 'change' , defaultValue: '', description: ' Changelist to build.', trim: true ) string(name: 'status' , defaultValue: '', description: ' Changelist status (shelved/submitted).', trim: true ) string(name: 'review' , defaultValue: '', description: ' Helix Swarm review number.', trim: true ) string(name: 'pass' , defaultValue: '', description: ' Success callback URL.', trim: true ) string(name: 'fail' , defaultValue: '', description: ' Failure callback URL.', trim: true ) } triggers { pollSCM( '@midnight' ) } stages { stage( 'sync' ) { steps { p4sync( credential: 'xxxxx' , format: 'xxxx.xxxx-${NODE_NAME}-${JOB_NAME}-${EXECUTOR_NUMBER}_ws' , populate: forceClean(), source: depotSource( " //path/${env.P4_BRANCH}/..." ) ) buildName "${env.P4_CHANGELIST} ${params.status?: ''} ${params.change?:' '}" .trim() sh '''sed ' s/\\\$Change: [0-9]\\+ \\\$/\$Change: '$P4_CHANGELIST' \$/ ' -i xxxxx/version.py' '' script { def dot = env.PROJECT_RELEASE.find(/rc$/) ? '' : ' .' env.PACKAGE_NAME = "${env.PROJECT_NAME}-${env.PROJECT_RELEASE}${dot}${env.P4_CHANGELIST}.tar.gz" } echo "PROJECT_NAME = $PROJECT_NAME" echo "PROJECT_BRANCH = $PROJECT_BRANCH" echo "PROJECT_RELEASE = $PROJECT_RELEASE" echo "P4_BRANCH = $P4_BRANCH" echo "PACKAGE_NAME = $PACKAGE_NAME" } } stage( 'build' ) { steps { sh ''' #!/bin/bash -xe # Clean up possible orphans from other test sessions pkill -P1 -U $(id -u) || true mkdir $TMP virtualenv venv . venv/bin/activate python setup.py sdist pip install dist/${PROJECT_NAME}-${PROJECT_RELEASE}*.tar.gz xxxxxx master --print-requirements | xargs pip install pip install -r tests/test_requirements.txt ''' } } stage( 'test' ) { steps { sh ''' . venv/bin/activate pytest -n 4 \ --timeout=300 \ --junit-xml ${WORKSPACE}/testresults.xml \ --verbose \ --cov=${PROJECT_NAME} \ --cov-branch \ --cov-report xml:${WORKSPACE}/coverage.xml \ --replay-record-dir=testreplay \ tests || true ''' junit 'testresults.xml' cobertura coberturaReportFile: 'coverage.xml' } } } } It never gets to "test" stage, seems that "sh" step in "build" stage never ends. If you need, I'll gladly provide more information.
            Hide
            jglick Jesse Glick added a comment -

            Daniel Jeznach you can file a separate issue (using the is caused by option to Link) with complete, minimal steps to reproduce from scratch.

            Offhand I suspect the pkill command is to blame, perhaps by killing too much. It is possible you did not need this to begin with—exiting a node block already sends a termination signal to all processes which inherited the (IIRC) JENKINS_MAGIC_COOKIE environment variable passed to any nested sh steps.

            Show
            jglick Jesse Glick added a comment - Daniel Jeznach you can file a separate issue (using the is caused by option to Link ) with complete, minimal steps to reproduce from scratch . Offhand I suspect the pkill command is to blame, perhaps by killing too much. It is possible you did not need this to begin with—exiting a node block already sends a termination signal to all processes which inherited the (IIRC) JENKINS_MAGIC_COOKIE environment variable passed to any nested sh steps.
            Hide
            daniel_jeznach_img Daniel Jeznach added a comment -

            Jesse Glick Yes, pkill was guilty in this case. Thank you for quick and accurate response.

            Show
            daniel_jeznach_img Daniel Jeznach added a comment - Jesse Glick Yes, pkill was guilty in this case. Thank you for quick and accurate response.

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                jglick Jesse Glick
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: