Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39835

Be super defensive in remoting read

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Not observed but an OOMErr could kill an agent connection as we would not reset the read ops if a throwable happened that was not a RuntimeException (ie any class of Error).

      The code should be defensive against this and terminate the connection so it can re-establish rather than being in the hung case.

        Attachments

          Issue Links

            Activity

            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: James Nord
            Path:
            src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java
            http://jenkins-ci.org/commit/remoting/ec9b5c13b879f44c04fa28ee6c8b113a165c9e57
            Log:
            Be extra defensive about Errors and Exceptions

            JENKINS-39835 Be even more defensive then against leaving connections dangling.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: James Nord Path: src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java http://jenkins-ci.org/commit/remoting/ec9b5c13b879f44c04fa28ee6c8b113a165c9e57 Log: Be extra defensive about Errors and Exceptions JENKINS-39835 Be even more defensive then against leaving connections dangling.
            Hide
            teilo James Nord added a comment -

            i believe this issue has now been observed on a live site

            Show
            teilo James Nord added a comment - i believe this issue has now been observed on a live site
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Oleg Nenashev
            Path:
            src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java
            http://jenkins-ci.org/commit/remoting/32674f6221cb93c7b5217231afc1b5fbec554d77
            Log:
            Merge pull request #133 from jenkinsci/jtnord-patch-1

            JENKINS-39835 - Be extra defensive about Errors and Exceptions

            Compare: https://github.com/jenkinsci/remoting/compare/b50beca9e888...32674f6221cb

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java http://jenkins-ci.org/commit/remoting/32674f6221cb93c7b5217231afc1b5fbec554d77 Log: Merge pull request #133 from jenkinsci/jtnord-patch-1 JENKINS-39835 - Be extra defensive about Errors and Exceptions Compare: https://github.com/jenkinsci/remoting/compare/b50beca9e888...32674f6221cb
            Hide
            mmitche Matthew Mitchell added a comment -

            Alright, so I root caused most of this. While there certainly are issues around the error handling, the errors we saw are all caused by memory. As we begin to run out of memory, the finally blocks that should zero out the channel object never get called. This causes a sort of cascading failure the manifests in a number of ways, including the error message above. The number of threads jumps, reflection starts to hang (job dsl starts to fail), etc.

            For my instance, the root cause was the workspace cleanup plugin + node recycling. This was keeping channel objects around forever in some cases, causing a slow leak.

            I would first verify that memory isn't the cause of the failure. I do the following:

            Watch number of threads:

            watch -n1 'find /proc/<jenkins pid>/task -maxdepth 1 -type d -print | wc -l'

            Watch gc stats:

            jstat -gccause -t -h25 <pid> 10s

            If the number of threads starts to jump into the high-thousands (depending on your heap setup) then that's a good indication.
            jstat will show failure to allocate eventually, and a high number of full gcs.

            Show
            mmitche Matthew Mitchell added a comment - Alright, so I root caused most of this. While there certainly are issues around the error handling, the errors we saw are all caused by memory. As we begin to run out of memory, the finally blocks that should zero out the channel object never get called. This causes a sort of cascading failure the manifests in a number of ways, including the error message above. The number of threads jumps, reflection starts to hang (job dsl starts to fail), etc. For my instance, the root cause was the workspace cleanup plugin + node recycling. This was keeping channel objects around forever in some cases, causing a slow leak. I would first verify that memory isn't the cause of the failure. I do the following: Watch number of threads: watch -n1 'find /proc/<jenkins pid>/task -maxdepth 1 -type d -print | wc -l' Watch gc stats: jstat -gccause -t -h25 <pid> 10s If the number of threads starts to jump into the high-thousands (depending on your heap setup) then that's a good indication. jstat will show failure to allocate eventually, and a high number of full gcs.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Oleg Nenashev
            Path:
            pom.xml
            http://jenkins-ci.org/commit/jenkins/7c2e1b2ece1770874eedd69cf20142aad4b491b9
            Log:
            [FIXED JENKINS-39835] - Update remoting to 3.4 (#2679)

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: pom.xml http://jenkins-ci.org/commit/jenkins/7c2e1b2ece1770874eedd69cf20142aad4b491b9 Log: [FIXED JENKINS-39835] - Update remoting to 3.4 (#2679)

              People

              • Assignee:
                teilo James Nord
                Reporter:
                teilo James Nord
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: