Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60673

Swarm agents sometimes disconnect and stay that way forever

    Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Component/s: swarm-plugin
    • Labels:
      None
    • Similar Issues:

      Description

      For context, our Jenkins master server JVM is dedicated to dispatching, with some of the infra jobs handled by an agent running on the same machine in a different account. Some time ago this agent was remade from SSH to Swarm, and over some uptime it tends to disconnect, and or not-connect when the master is too busy

      We had an outage starting a few days ago that was only noticed now as we came back to work, e.g. this being the last logged line:

       Jan 05 08:59:37 jenkins2 bash[15117]: INFO: Failed to send back a reply to the request hudson.remoting.Request$2@525facc1: hudson.remoting.ChannelClosedException: Channel "unknown": Protocol stack cannot write data anymore. It is not open for write
      

      The agent JVM continued running, so neither some logic inside the agent, nor systemd, would restart it to actually reconnect and keep the real service provided. Here the JVM seems to be alive, so nothing is restarted by the OS, and the agent just disappears from master since there is no connection.

      In fact, preceding lines point to insufficient memory (and I have no idea how much we should throw at it, because with whatever settings we tried it works for days and weeks and then suddenly it does not; at the moment we have java -Xms64m -Xmx512m for the agent).

      Jan 05 08:59:37 jenkins2 bash[15117]: at java.lang.Thread.run(Thread.java:748)
      Jan 05 08:59:37 jenkins2 bash[15117]: Caused by: java.lang.OutOfMemoryError: Java heap space
      

        Attachments

          Activity

          Hide
          basil Basil Crow added a comment -

          Hey Jim Klimov, thanks for reporting this. While there isn't enough information above for me to be able to determine the root cause, may I suggest the following actions to help debug the issue:

          • Try starting the Swarm client with more verbose logging. This page provides an example of a verbose logging.properties file that logs as much as possible. Perhaps the additional logs will shed more light into what the Swarm client was doing at the time of the failure.
          • Regarding the java.lang.OutOfMemoryError: Java heap space error, you might try enabling the -XX:+HeapDumpOnOutOfMemoryError JVM option and then analyzing the heap dump using standard JVM memory analysis techniques to determine the source of the memory utilization.
          • Regarding the Protocol stack cannot write data anymore. It is not open for write error, this emanates from the underlying Jenkins Remoting library (GitHub), maintained by Jeff Thompson. It might be worth reading the source code corresponding to the error and trying to work your way backwards to understand how that pathological state was reached. This is easier to do after you have enabled more verbose logs. If you believe there is a Remoting issue, you can file one in Jira under the remoting component to be investigated by Jeff.
          Show
          basil Basil Crow added a comment - Hey Jim Klimov , thanks for reporting this. While there isn't enough information above for me to be able to determine the root cause, may I suggest the following actions to help debug the issue: Try starting the Swarm client with more verbose logging. This page provides an example of a verbose logging.properties file that logs as much as possible. Perhaps the additional logs will shed more light into what the Swarm client was doing at the time of the failure. Regarding the java.lang.OutOfMemoryError: Java heap space error, you might try enabling the -XX:+HeapDumpOnOutOfMemoryError JVM option and then analyzing the heap dump using standard JVM memory analysis techniques to determine the source of the memory utilization. Regarding the Protocol stack cannot write data anymore. It is not open for write error, this emanates from the underlying Jenkins Remoting library ( GitHub ), maintained by Jeff Thompson. It might be worth reading the source code corresponding to the error and trying to work your way backwards to understand how that pathological state was reached. This is easier to do after you have enabled more verbose logs. If you believe there is a Remoting issue, you can file one in Jira under the remoting component to be investigated by Jeff.
          Hide
          basil Basil Crow added a comment -

          As was mentioned in the previous comment, the out of memory error may have been caused by any Jenkins plugin that runs code agent-side, not necessarily the Swarm client. You might try enabling the -XX:+HeapDumpOnOutOfMemoryError JVM option and then analyzing the heap dump using standard JVM memory analysis techniques to determine the source of the memory utilization. I am closing this issue as "Cannot Reproduce", but if you see evidence pointing to a memory leak in the Swarm client, please open a new issue with detailed steps to reproduce.

          Show
          basil Basil Crow added a comment - As was mentioned in the previous comment, the out of memory error may have been caused by any Jenkins plugin that runs code agent-side, not necessarily the Swarm client. You might try enabling the -XX:+HeapDumpOnOutOfMemoryError JVM option and then analyzing the heap dump using standard JVM memory analysis techniques to determine the source of the memory utilization. I am closing this issue as "Cannot Reproduce", but if you see evidence pointing to a memory leak in the Swarm client, please open a new issue with detailed steps to reproduce.
          Hide
          jimklimov Jim Klimov added a comment -

          Thanks for the suggestions. For some past months (after bumping Xmx settings a few times, I suppose) the issue did not re-appear, so at the moment I have no heap dumps to post...

          Show
          jimklimov Jim Klimov added a comment - Thanks for the suggestions. For some past months (after bumping Xmx settings a few times, I suppose) the issue did not re-appear, so at the moment I have no heap dumps to post...

            People

            • Assignee:
              Unassigned
              Reporter:
              jimklimov Jim Klimov
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: