Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49097

Ssh-agent-plugin doesn't kill ssh-agent in top-level matrix jobs

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: ssh-agent-plugin
    • Labels:
      None
    • Environment:
      Jenkins 2.32.3
      ssh-agent-plugin 1.15
    • Similar Issues:

      Description

      Ssh-agent-plugin starts, but does not kill ssh-agent processes in top-level matrix jobs.

      00:00:00.052 [ssh-agent] Looking for ssh-agent implementation...
      00:00:00.167 [ssh-agent]   Exec ssh-agent (binary ssh-agent on a remote machine)
      00:00:00.189 $ ssh-agent
      00:00:00.278 SSH_AUTH_SOCK=/tmp/ssh-T6i78P9tKd5A/agent.28069
      00:00:00.278 SSH_AGENT_PID=28071
      00:00:00.278 [ssh-agent] Started.
      00:00:00.389 $ ssh-add /home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key
      00:00:00.408 Identity added: /home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key (/home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key)
      00:00:00.520 [ssh-agent] Using credentials tcwg-buildslave (buildslave for TCWG machines)
      00:00:00.542 Set build name.
      00:00:00.543 Triggering TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build
      00:00:05.545 Configuration TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build is still in the queue: Waiting for next available executor on tcwg-x86_64-build
      06:43:08.741 TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build completed with result FAILURE
      06:43:08.902 Set build name.
      06:43:08.905 Unrecognized macro 'branch' in '${branch} #399'
      06:43:08.907 Finished: FAILURE
      

      Since top-level matrix job only spawns child jobs, it doesn't really need access to ssh-agent keys (note that SCM clones/checkouts use their own interface to ssh-agent-plugin).  Therefore ssh-agent-plugin can either not start ssh-agent for top-level matrix jobs at all, or terminate them during cleanup.  It is not clear why existing cleanup code does not trigger for top-level matrix jobs.

      This issue is causing thousands of ssh-agent processes to accumulate on busy systems.  To cleanup these jobs one needs to wait till system is idle to avoid killing the few active ssh-agent processes.  Busy systems, unfortunately, are rarely idle.

        Attachments

          Activity

          maxim_kuvyrkov Maxim Kuvyrkov created issue -
          maxim_kuvyrkov Maxim Kuvyrkov made changes -
          Field Original Value New Value
          Description When a job with the {{SSHAgentBuildWrapper}} enabled fails very early (for instance during SCM checkout), an {{ssh-agent}} process is left behind. The issue is that the {{SSHAgentEnvironment}} is instantiated very early (from {{preCheckout}}), but its {{tearDown}} method will only be called if execution reaches {{BuildExecution.doRun}} (which comes after the SCM checkout phase in {{AbstractBuildExecution.run}}).

          Before {{ssh-agent-plugin 1.14}}, there was no {{ssh-agent}} process, so the issue with some {{SSHAgentEnvironment}} not being teared down was less visible (but probably there was already some other kind of less obvious resources leaks with {{AgentServer}} not being properly closed).

          This kind of issue with some {{Environment}} not being properly teared down can happen as soon as they are not instantiated from {{BuildWrapper.setUp}}, but from earlier phases (like {{BuildWrapper.preCheckout}} or {{RunListener.setUpEnvironment}}). As such, maybe that's something that should be fixed in core (maybe in {{AbstractBuildExecution.run}}) rather than specifically in the {{ssh-agent-plugin}}, I don't know...

          I've written and attached a "generic workaround" {{RunListener}}, which tries to detect this situation from {{onComplete}}, and call {{tearDown}} for all {{Environment}} if it has not been done already. It's not something I propose for inclusion, but rather some code to exhibit the issue. If an ssh-agent specific fix is desirable, then a similar approach might be an option (but targeting {{SSHAgentEnvironment}} only).
          maxim_kuvyrkov Maxim Kuvyrkov made changes -
          Description Ssh-agent-plugin starts, but does not kill ssh-agent processes in top-level matrix jobs.
          {code:java}
          00:00:00.052 [ssh-agent] Looking for ssh-agent implementation...
          00:00:00.167 [ssh-agent] Exec ssh-agent (binary ssh-agent on a remote machine)
          00:00:00.189 $ ssh-agent
          00:00:00.278 SSH_AUTH_SOCK=/tmp/ssh-T6i78P9tKd5A/agent.28069
          00:00:00.278 SSH_AGENT_PID=28071
          00:00:00.278 [ssh-agent] Started.
          00:00:00.389 $ ssh-add /home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key
          00:00:00.408 Identity added: /home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key (/home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key)
          00:00:00.520 [ssh-agent] Using credentials tcwg-buildslave (buildslave for TCWG machines)
          00:00:00.542 Set build name.
          00:00:00.543 Triggering TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build
          00:00:05.545 Configuration TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build is still in the queue: Waiting for next available executor on tcwg-x86_64-build
          06:43:08.741 TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build completed with result FAILURE
          06:43:08.902 Set build name.
          06:43:08.905 Unrecognized macro 'branch' in '${branch} #399'
          06:43:08.907 Finished: FAILURE
          {code}
          Since top-level matrix job only spawns child jobs, it doesn't really need access to ssh-agent keys (note that SCM clones/checkouts use their own interface to ssh-agent-plugin).  Therefore ssh-agent-plugin can either not start ssh-agent for top-level matrix jobs at all, or terminate them during cleanup.  It is not clear why existing cleanup code does not trigger for top-level matrix jobs.

          This issue is causing thousands of ssh-agent processes to accumulate on busy systems.  To cleanup these jobs one needs to wait till system is idle to avoid killing the few active ssh-agent processes.  Busy systems, unfortunately, are rarely idle.

            People

            • Assignee:
              Unassigned
              Reporter:
              maxim_kuvyrkov Maxim Kuvyrkov
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: