Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-38381

[JEP-210] Optimize log handling in Pipeline and Durable Task

    Details

    • Similar Issues:
    • Sprint:
      Pipeline - October, Pipeline - April 2018

      Description

      Extracted from JENKINS-30896:

      The current design of LogActionImpl, using $id.log, was considered the minimum necessary for a working 1.0 release, not a serious implementation. It has a major problem: when there is a large amount of output, WorkflowRun.copyLogs must duplicate it all to log, doubling disk space requirements per build.

      It would be better to keep a single log file for the build. LogActionImpl should deprecated in favor of an implementation that simply stores a rangeset of offsets into that file. When parallel blocks are producing concurrent output, the single log file will be a bit jumbled (probably still human-readable in most cases), but the rangesets will keep track of what output came from where. The final output produced by WorkflowRun will still be processed to split at line boundaries, add in thread labels, etc. (TBD how and whether JENKINS-30777 could be supported in this mode.)

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            Expanding scope a bit since Oleg Nenashev’s work requires org.jenkinsci.plugins.durabletask.Controller.writeLog to be reconsidered as well: we do not want to force step logs to be streamed back to the master. So need an extension point (which for ease of development could start off inside a plugin) which would allow a Run to obtain a BuildListener whose serial form would allow this alternate mode.

            Show
            jglick Jesse Glick added a comment - Expanding scope a bit since Oleg Nenashev ’s work requires org.jenkinsci.plugins.durabletask.Controller.writeLog to be reconsidered as well: we do not want to force step logs to be streamed back to the master. So need an extension point (which for ease of development could start off inside a plugin) which would allow a Run to obtain a BuildListener whose serial form would allow this alternate mode.
            Hide
            jglick Jesse Glick added a comment -

            Can roll in a fix of JENKINS-31096 by just using UTF-8 everywhere.

            Show
            jglick Jesse Glick added a comment - Can roll in a fix of JENKINS-31096 by just using UTF-8 everywhere.
            Hide
            jglick Jesse Glick added a comment -

            Have an early draft available. Not yet trying to avoid streaming DurableTaskStep output back to the master.

            Show
            jglick Jesse Glick added a comment - Have an early draft available. Not yet trying to avoid streaming DurableTaskStep output back to the master.
            Hide
            jglick Jesse Glick added a comment -

            Basic implementation of a unified log is working. Now I am looking into having an externalizable log storage, and specifically allowing DurableTaskStep.Execution to stream log text directly there from an agent. But when the master is not involved in collecting log output from a sh/bat step, then it is pointless to have it polling the agent over Remoting every 15s or less just to get an exit status and save a log location—this just adds traffic to the channel, and load to the master. So now I am investigating whether the controller can autonomously stream output and send a callback with exit status. This is essentially a rewrite a durable task handling. The goal is that if you run, say,

            node {
              sh '''#!/bin/bash
            for x in {0000..9999}
            do
              echo $x
              sleep 1
            done
            '''
            }
            

            then for the next three hours there should be zero activity on the remoting channel (other than ping threads) and zero load on master related to this build (assuming no human is watching the log).

            Show
            jglick Jesse Glick added a comment - Basic implementation of a unified log is working. Now I am looking into having an externalizable log storage, and specifically allowing DurableTaskStep.Execution to stream log text directly there from an agent. But when the master is not involved in collecting log output from a sh / bat step, then it is pointless to have it polling the agent over Remoting every 15s or less just to get an exit status and save a log location—this just adds traffic to the channel, and load to the master. So now I am investigating whether the controller can autonomously stream output and send a callback with exit status. This is essentially a rewrite a durable task handling. The goal is that if you run, say, node { sh '''#!/bin/bash for x in {0000..9999} do echo $x sleep 1 done ''' } then for the next three hours there should be zero activity on the remoting channel (other than ping threads) and zero load on master related to this build (assuming no human is watching the log).
            Hide
            jglick Jesse Glick added a comment -

            Ran into an obstacle: BourneShellScript as currently implemented relies on ProcessLiveness to tell if a process has exited improperly, so sh can fail with a -1 exit code. Now if this is using a LocalLauncher or a RemoteLauncher, the normal case, it will directly call getpgid in libc. But if it is dealing with a decorated launcher, it delegates to that in order to tell if the process is alive. When you are using docker.image(…).inside, that is what allows the agent to determine whether a process inside the container is alive; getpgid will not work, since Docker remaps PIDs. The problem is that this hook is done by passing a special Launcher as a contextual object in the Pipeline step’s block, but this is not Serializable and not safe for transfer to the agent. So without new APIs, that docker-workflow would need to implement, there seems to be no way for an autonomous agent to check process liveness when using .inside without asking for help from the master, which would defeat some of the performance benefits here when using Docker.

            Possibly need to rework this system so that process liveness checks are not done constantly when checking exit status, but rather periodically at the initiation of the master (say every hour) while there is no reported exit status. That might also help work around lost notifications etc.

            Show
            jglick Jesse Glick added a comment - Ran into an obstacle: BourneShellScript as currently implemented relies on ProcessLiveness to tell if a process has exited improperly, so sh can fail with a -1 exit code. Now if this is using a LocalLauncher or a RemoteLauncher , the normal case, it will directly call getpgid in libc . But if it is dealing with a decorated launcher, it delegates to that in order to tell if the process is alive. When you are using docker.image(…).inside , that is what allows the agent to determine whether a process inside the container is alive; getpgid will not work, since Docker remaps PIDs. The problem is that this hook is done by passing a special Launcher as a contextual object in the Pipeline step’s block, but this is not Serializable and not safe for transfer to the agent. So without new APIs, that docker-workflow would need to implement, there seems to be no way for an autonomous agent to check process liveness when using .inside without asking for help from the master, which would defeat some of the performance benefits here when using Docker. Possibly need to rework this system so that process liveness checks are not done constantly when checking exit status, but rather periodically at the initiation of the master (say every hour) while there is no reported exit status. That might also help work around lost notifications etc.
            Hide
            jglick Jesse Glick added a comment -

            I think I have a design that will work, but still fleshing out a prototype.

            Show
            jglick Jesse Glick added a comment - I think I have a design that will work, but still fleshing out a prototype.
            Hide
            jglick Jesse Glick added a comment -

            Basic stuff now seems to be working.

            Show
            jglick Jesse Glick added a comment - Basic stuff now seems to be working.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            pom.xml
            src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java
            src/main/java/org/jenkinsci/plugins/durabletask/Controller.java
            src/main/java/org/jenkinsci/plugins/durabletask/DurableTask.java
            src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java
            src/main/java/org/jenkinsci/plugins/durabletask/Handler.java
            src/main/java/org/jenkinsci/plugins/durabletask/executors/OnceRetentionStrategy.java
            src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
            http://jenkins-ci.org/commit/durable-task-plugin/eaa6b02e8a2aec369f727c29a15a7fc44ebcd12a
            Log:
            JENKINS-38381 Prototype API to receive asynchronous notifications of process output or exit code.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java src/main/java/org/jenkinsci/plugins/durabletask/Controller.java src/main/java/org/jenkinsci/plugins/durabletask/DurableTask.java src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java src/main/java/org/jenkinsci/plugins/durabletask/Handler.java src/main/java/org/jenkinsci/plugins/durabletask/executors/OnceRetentionStrategy.java src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java http://jenkins-ci.org/commit/durable-task-plugin/eaa6b02e8a2aec369f727c29a15a7fc44ebcd12a Log: JENKINS-38381 Prototype API to receive asynchronous notifications of process output or exit code.
            Hide
            jglick Jesse Glick added a comment -

            Integrated demo is in the logstash PR.

            Show
            jglick Jesse Glick added a comment - Integrated demo is in the logstash PR.
            Hide
            recampbell Ryan Campbell added a comment - - edited

            Benefits of this change: This change will fix unicode support. This change will also send all log information to a single sink instead of seperate logfiles per step, removing a bottleneck in the build logging. Also, instead of the master polling for output from the pipeline, the logs are instead pushed from the slave to the master.

            We agree that we should implement this change, but leave the extension point `PipelineLogFile` marked as beta.

            Our plan is for Jesse to complete testing of this change and integrate it into the master branch when ready.

            Show
            recampbell Ryan Campbell added a comment - - edited Benefits of this change: This change will fix unicode support. This change will also send all log information to a single sink instead of seperate logfiles per step, removing a bottleneck in the build logging. Also, instead of the master polling for output from the pipeline, the logs are instead pushed from the slave to the master. We agree that we should implement this change, but leave the extension point `PipelineLogFile` marked as beta. Our plan is for Jesse to complete testing of this change and integrate it into the master branch when ready.
            Hide
            jglick Jesse Glick added a comment -

            Baseline (pre-patch) performance for a build with one (echo) step producing a million lines of output on my laptop:

            • Took 1791ms to run the build
            • Took 437ms to write HTML of whole build
            • Took 0ms to compute length of whole build
            • Took 19ms to write truncated HTML of whole build
            • Took 11ms to write plain text of whole build
            • Took 340ms to write HTML of one node
            • Took 0ms to compute length of one node
            • Took 6ms to write truncated HTML of one node
            • Took 10ms to write plain text of one node

            With the patch (but not using remote log storage):

            • Took 3687ms to run the build
            • Took 866ms to write HTML of whole build
            • Took 127ms to compute length of whole build
            • Took 129ms to write truncated HTML of whole build
            • Took 162ms to write plain text of whole build
            • Took 630ms to write HTML of one node
            • Took 115ms to compute length of one node
            • Took 148ms to write truncated HTML of one node
            • Took 75ms to write plain text of one node

            The build itself seems to take longer, suggesting the need for optimization in AnnotatedLogAction.DecoratedTaskListener. (Though this is not a very realistic case, since it represents a step producing megabytes of output all at once, making things easy on copyLogs; in practice output would come incrementally, which the baseline version would probably spend more time copying.)

            As to the rendering of the log (if and when someone asks for it), the second set of numbers represent measurable load on the server, but a log this big seems like it would be rare in the field, and anyway the cost of rendering such log output in Chrome on the client side is much higher. In other words, the subjective overhead of viewing the build is mostly due to the browser, not any processing on the server. time curl -I … produces responses rather quickly. So I am less concerned about that.

            Show
            jglick Jesse Glick added a comment - Baseline (pre-patch) performance for a build with one ( echo ) step producing a million lines of output on my laptop: Took 1791ms to run the build Took 437ms to write HTML of whole build Took 0ms to compute length of whole build Took 19ms to write truncated HTML of whole build Took 11ms to write plain text of whole build Took 340ms to write HTML of one node Took 0ms to compute length of one node Took 6ms to write truncated HTML of one node Took 10ms to write plain text of one node With the patch (but not using remote log storage): Took 3687ms to run the build Took 866ms to write HTML of whole build Took 127ms to compute length of whole build Took 129ms to write truncated HTML of whole build Took 162ms to write plain text of whole build Took 630ms to write HTML of one node Took 115ms to compute length of one node Took 148ms to write truncated HTML of one node Took 75ms to write plain text of one node The build itself seems to take longer, suggesting the need for optimization in AnnotatedLogAction.DecoratedTaskListener . (Though this is not a very realistic case, since it represents a step producing megabytes of output all at once, making things easy on copyLogs ; in practice output would come incrementally, which the baseline version would probably spend more time copying.) As to the rendering of the log (if and when someone asks for it), the second set of numbers represent measurable load on the server, but a log this big seems like it would be rare in the field, and anyway the cost of rendering such log output in Chrome on the client side is much higher. In other words, the subjective overhead of viewing the build is mostly due to the browser, not any processing on the server. time curl -I … produces responses rather quickly. So I am less concerned about that.
            Hide
            jglick Jesse Glick added a comment -

            Did some optimizations; down to:

            • Took 2525ms to run the build
            • Took 761ms to write HTML of whole build
            • Took 133ms to compute length of whole build
            • Took 86ms to write truncated HTML of whole build
            • Took 111ms to write plain text of whole build
            • Took 379ms to write HTML of one node
            • Took 57ms to compute length of one node
            • Took 64ms to write truncated HTML of one node
            • Took 62ms to write plain text of one node
            Show
            jglick Jesse Glick added a comment - Did some optimizations; down to: Took 2525ms to run the build Took 761ms to write HTML of whole build Took 133ms to compute length of whole build Took 86ms to write truncated HTML of whole build Took 111ms to write plain text of whole build Took 379ms to write HTML of one node Took 57ms to compute length of one node Took 64ms to write truncated HTML of one node Took 62ms to write plain text of one node
            Hide
            ssbarnea Sorin Sbarnea added a comment -

            Jesse Glick Any updates on this? It seems that we hit that one and TBH I don't find some performance degradation as a read problem, if that is fixing a real bug.

            Now we had production downtimes due to https://issues.jenkins-ci.org/browse/JENKINS-37575 (which depends on this). This would count as more serious than some performance degradation.

            Show
            ssbarnea Sorin Sbarnea added a comment - Jesse Glick Any updates on this? It seems that we hit that one and TBH I don't find some performance degradation as a read problem, if that is fixing a real bug. Now we had production downtimes due to https://issues.jenkins-ci.org/browse/JENKINS-37575  (which depends on this). This would count as more serious than some performance degradation.
            Hide
            jglick Jesse Glick added a comment -

            Ryan Campbell asked for this to be put on hold pending some performance testing. (More to prove that it provides some benefit rather than proving it does not cause some regression, though you never know until you try.) Indeed there are a handful of open bugs in areas of code which are rewritten by this set of PRs. I would like to revisit it.

            Show
            jglick Jesse Glick added a comment - Ryan Campbell asked for this to be put on hold pending some performance testing. (More to prove that it provides some benefit rather than proving it does not cause some regression , though you never know until you try.) Indeed there are a handful of open bugs in areas of code which are rewritten by this set of PRs. I would like to revisit it.
            Hide
            michaelneale Michael Neale added a comment -

            Discussed today - Sam Van Oort has a scalability test suite WIP - which is a precondition to this (ie we need to be able to measure/prove performance before and after this set of changes). My main concern with the implementation that I have seen is the speed of fetching a log for a specific step (be that in a single file, or an external store). I expect Sam Van Oort or Andrew Bayer may pick this up.

            Show
            michaelneale Michael Neale added a comment - Discussed today - Sam Van Oort has a scalability test suite WIP - which is a precondition to this (ie we need to be able to measure/prove performance before and after this set of changes). My main concern with the implementation that I have seen is the speed of fetching a log for a specific step (be that in a single file, or an external store). I expect Sam Van Oort or Andrew Bayer may pick this up.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            pom.xml
            src/test/java/org/jenkinsci/plugins/workflow/support/actions/LogActionImplTest.java
            http://jenkins-ci.org/commit/workflow-support-plugin/f7777331bfda5ad7694451f748a3d9e80859c2c2
            Log:
            Seems to also allow logging from steps which run a block first, though the logging does come out of logical order (which JENKINS-38381 would address).

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/test/java/org/jenkinsci/plugins/workflow/support/actions/LogActionImplTest.java http://jenkins-ci.org/commit/workflow-support-plugin/f7777331bfda5ad7694451f748a3d9e80859c2c2 Log: Seems to also allow logging from steps which run a block first, though the logging does come out of logical order (which JENKINS-38381 would address).
            Hide
            svanoort Sam Van Oort added a comment - - edited

            Consider it picked up as part of the second batch of storage work – I plan to complete the storage and durability modes first (along with some polish on the scalability testing work). Edit: the reason for that is that the APIs are needed for downstream work.

            Additional comment: profiling found that copyLogs seems to introduce a large amount of the garbage generation and GC load in running trivial pipelines.

            • Specifically due to getLogText from LogActionImpl and WorkflowRun.writeRawLogTo
            • Generally due to creating fresh byte arrays and buffers for I/O rather than reusing single ones, but also due to the large number of File and Stream objects created. That could be fairly trivially resolved by using the same buffer - since copyLogs is guarded by synchronization on the WorkflowRun.completed object, single instances could be reused safely (if need implementations of the LargeText/AnnotatedLargeText logic were used). This would add a few kB to memory overheads for pipeline but greatly reduces GC load – meaning probably less heap needed overall.
            Show
            svanoort Sam Van Oort added a comment - - edited Consider it picked up as part of the second batch of storage work – I plan to complete the storage and durability modes first (along with some polish on the scalability testing work). Edit: the reason for that is that the APIs are needed for downstream work. Additional comment: profiling found that copyLogs seems to introduce a large amount of the garbage generation and GC load in running trivial pipelines. Specifically due to getLogText from LogActionImpl and WorkflowRun.writeRawLogTo Generally due to creating fresh byte arrays and buffers for I/O rather than reusing single ones, but also due to the large number of File and Stream objects created. That could be fairly trivially resolved by using the same buffer - since copyLogs is guarded by synchronization on the WorkflowRun.completed object, single instances could be reused safely (if need implementations of the LargeText/AnnotatedLargeText logic were used). This would add a few kB to memory overheads for pipeline but greatly reduces GC load – meaning probably less heap needed overall.
            Hide
            drulli Ulli Hafner added a comment -

            I wrote another test case that reproduces the problem of JENKINS-32191. Makes parsing of the console log with the warnings plug-in on in pipelines quite a lottery.

            Show
            drulli Ulli Hafner added a comment - I wrote  another test case that reproduces the problem of JENKINS-32191 . Makes parsing of the console log with the warnings plug-in on in pipelines quite a lottery.
            Hide
            jglick Jesse Glick added a comment -

            workflow-durable-task-step #65 covers everything I have updated recently: there is now a set of independently mergeable PRs for:

            • push rather than pull of durable task output
            • proper handling of Unicode (JENKINS-31096)
            • both of those together (so they could be merged in either order)

            workflow-job #27 is now downstream of those changes, and continues to encapsulate the broader rewrite:

            • single log sink for all Pipeline steps
            • SPI to allow that sink be an external service
            • redesign of step console annotations for a more usable display in the classic UI (this requires the first item)
            Show
            jglick Jesse Glick added a comment - workflow-durable-task-step #65 covers everything I have updated recently: there is now a set of independently mergeable PRs for: push rather than pull of durable task output proper handling of Unicode ( JENKINS-31096 ) both of those together (so they could be merged in either order) workflow-job #27 is now downstream of those changes, and continues to encapsulate the broader rewrite: single log sink for all Pipeline steps SPI to allow that sink be an external service redesign of step console annotations for a more usable display in the classic UI (this requires the first item)
            Hide
            shahmishal mishal shah added a comment -

            Jesse Glick Do you know when this will be released? We are also seeing infinite log issue.

            Show
            shahmishal mishal shah added a comment - Jesse Glick Do you know when this will be released? We are also seeing infinite log issue.
            Hide
            victorbjelkholm Victor Bjelkholm added a comment -

            Any updates on this issue? Currently, this is slowing down builds and would be great to have fixed, or at least marked as "stalled" || "blocked" if that's the situation.

            Show
            victorbjelkholm Victor Bjelkholm added a comment - Any updates on this issue? Currently, this is slowing down builds and would be great to have fixed, or at least marked as "stalled" || "blocked" if that's the situation.
            Hide
            jglick Jesse Glick added a comment -

            PRs refreshed.

            Show
            jglick Jesse Glick added a comment - PRs refreshed.
            Hide
            stevenfoster Steven Foster added a comment -

            Nice, would this make any use of the work on artifact-manager-s3-plugin for the external storage part?

            Show
            stevenfoster Steven Foster added a comment - Nice, would this make any use of the work on artifact-manager-s3-plugin for the external storage part?
            Hide
            jglick Jesse Glick added a comment -

            No, external artifact management is unrelated.

            Show
            jglick Jesse Glick added a comment - No, external artifact management is unrelated.
            Hide
            jglick Jesse Glick added a comment -

            Yesterday I redesigned the API to more clearly separate layers of responsibility and allow cloud-based implementations to perform better. I have abandoned the logstash PoC implementation.

            Show
            jglick Jesse Glick added a comment - Yesterday I redesigned the API to more clearly separate layers of responsibility and allow cloud-based implementations to perform better. I have abandoned the logstash PoC implementation.
            Hide
            webrat Andreas Sieferlinger added a comment -

            any update on this? its currently marked as in progress.

            Show
            webrat Andreas Sieferlinger added a comment - any update on this? its currently marked as in progress.
            Hide
            jglick Jesse Glick added a comment -

            Hoping for some review activity soon.

            Show
            jglick Jesse Glick added a comment - Hoping for some review activity soon.
            Hide
            jglick Jesse Glick added a comment -

            I filed a JEP submission for this.

            Show
            jglick Jesse Glick added a comment - I filed a JEP submission for this.
            Hide
            jglick Jesse Glick added a comment -

            Released in beta.

            Show
            jglick Jesse Glick added a comment - Released in beta.

              People

              • Assignee:
                jglick Jesse Glick
                Reporter:
                jglick Jesse Glick
              • Votes:
                35 Vote for this issue
                Watchers:
                81 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: