Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-72286

ClosedChannelException occurring immediately after pod detected as running

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • Jenkins 2.414.3, Kubernetes plugin
      Version4054.v2da_8e2794884, Agent jenkins/inbound-agent:3181.v78543a_987053-1-alpine-jdk17

      We're running our CI pipelines on Kubernetes where each pull request can spawn up to 40 parallel stages.  On occasion, we would experience a build failure because of a ClosedChannelException occurring just after the controller recognizes that the agent's pod is running.  Recently, we've experienced a 4x increase in these types of failures – almost always associated during the time of a scale-up event in the cluster.  From the controller:
       

      2023-11-03 16:13:40.442+0000 [id=999274]    INFO    hudson.remoting.Request$2#run: Failed to send back a reply to the request RPCRequest:hudson.remoting.RemoteClassLoader$IClassLoader.fetch3[java.lang.String](2): hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@121ef5f4:JNLP4-connect connection from ip-172-16-23-139.us-west-2.compute.internal/172.16.23.139:57914": channel is already closed
      2023-11-03 16:13:41.593+0000 [id=999274]    INFO    o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes jenkins/github-cdf-main-4924-hpzfn-njbgn-zrsv8
      2023-11-03 16:13:43.506+0000 [id=999274]    INFO    o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes jenkins/github-cdf-main-4924-hpzfn-njbgn-zrsv8
      2023-11-03 16:13:53.850+0000 [id=999274]    INFO    o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes jenkins/github-cdf-main-4925-pcbsj-wm756-gdcsl
      2023-11-03 16:15:11.482+0000 [id=999825]    INFO    o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [github-cdf-main-4924-qms8v-zmxbl-xgs5s], container: [clumio-base][230 ms]
      2023-11-03 16:15:23.316+0000 [id=999825]    INFO    o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [github-cdf-main-4925-96lxr-n5s81-f83qm], container: [clumio-base][230 ms]
      2023-11-03 16:15:25.021+0000 [id=999274]    INFO    o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes jenkins/github-cdf-main-4925-pcbsj-wm756-gdcsl
      2023-11-03 16:15:37.815+0000 [id=999825]    INFO    o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [github-cdf-main-4925-bjj6c-wfdcl-rn53j], container: [clumio-base][239 ms]
      2023-11-03 16:15:53.128+0000 [id=90]    INFO    hudson.slaves.NodeProvisioner#update: github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t provisioning successfully completed. We have now 111 computer(s)
      2023-11-03 16:15:53.274+0000 [id=999825]    INFO    o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t
      2023-11-03 16:16:01.278+0000 [id=999274]    INFO    o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [githubjira-cdf-pr-42-3-j24cp-d0x9w-6x4n0], container: [clumio-base][229 ms]
      2023-11-03 16:16:13.049+0000 [id=1000278]    INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #100276 from /172.16.23.81:51200
      2023-11-03 16:16:19.599+0000 [id=999274]    INFO    o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [github-cdf-main-4925-ql51w-h2r2w-8mnff], container: [clumio-base][243 ms]
      2023-11-03 16:16:52.713+0000 [id=999825]    INFO    o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t
      2023-11-03 16:16:53.006+0000 [id=999274]    INFO    j.s.DefaultJnlpSlaveReceiver#channelClosed: IOHub#1: Worker[channel:java.nio.channels.SocketChannel[connected local=/172.16.26.111:50000 remote=ip-172-16-23-81.us-west-2.compute.internal/172.16.23.81:51200]] / Computer.threadPoolForRemoting [#73270] for github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t terminated: java.nio.channels.ClosedChannelException
      2023-11-03 16:17:41.779+0000 [id=999174]    INFO    o.c.j.p.k.p.r.Reaper$TerminateAgentOnContainerTerminated#lambda$onEvent$1: jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t Container clumio-base was just terminated, so removing the corresponding Jenkins agent
      2023-11-03 16:17:41.779+0000 [id=999174]    INFO    o.c.j.p.k.p.r.Reaper$TerminateAgentOnContainerTerminated#lambda$onEvent$1: jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t Container jnlp was just terminated, so removing the corresponding Jenkins agent
      2023-11-03 16:17:41.808+0000 [id=999174]    INFO    o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t
      2023-11-03 16:17:41.813+0000 [id=999174]    WARNING    o.c.j.p.k.KubernetesSlave#deleteSlavePod: Failed to delete pod for agent jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t: not found
      2023-11-03 16:17:41.814+0000 [id=999174]    INFO    o.c.j.p.k.KubernetesSlave#_terminate: Disconnected computer github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t
      2023-11-03 16:17:41.819+0000 [id=999174]    INFO    o.c.j.p.k.p.r.Reaper$TerminateAgentOnPodFailed#onEvent: jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t Pod just failed. Removing the corresponding Jenkins agent. Reason: null, Message: null
      2023-11-03 16:17:41.835+0000 [id=999174]    INFO    o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t
      2023-11-03 16:17:41.835+0000 [id=999174]    SEVERE    o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null: github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t
      2023-11-03 16:17:41.836+0000 [id=999174]    INFO    hudson.slaves.AbstractCloudSlave#terminate: FATAL: Computer for agent is null: github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t
      2023-11-03 16:22:04.573+0000 [id=1001374]    INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #100384 from /172.16.23.81:43584

       

      Java Runtime on Controller
       

      openjdk 17.0.8.1 2023-08-24 OpenJDK Runtime Environment Temurin-17.0.8.1+1 (build 17.0.8.1+1) OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (build 17.0.8.1+1, mixed mode)

       

      One other item of note is that our pipelines can't simply be restarted from the point of failure; they must undergo the entire build cycle when the job fails.  Therefore, a configurable retry w/ backoff for initial pod startup would significantly help address this issue.

            Unassigned Unassigned
            tonyclumio Anthony Cruz
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: