Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60673

Swarm agents sometimes disconnect and stay that way forever

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • swarm-plugin
    • None

      For context, our Jenkins master server JVM is dedicated to dispatching, with some of the infra jobs handled by an agent running on the same machine in a different account. Some time ago this agent was remade from SSH to Swarm, and over some uptime it tends to disconnect, and or not-connect when the master is too busy

      We had an outage starting a few days ago that was only noticed now as we came back to work, e.g. this being the last logged line:

       Jan 05 08:59:37 jenkins2 bash[15117]: INFO: Failed to send back a reply to the request hudson.remoting.Request$2@525facc1: hudson.remoting.ChannelClosedException: Channel "unknown": Protocol stack cannot write data anymore. It is not open for write
      

      The agent JVM continued running, so neither some logic inside the agent, nor systemd, would restart it to actually reconnect and keep the real service provided. Here the JVM seems to be alive, so nothing is restarted by the OS, and the agent just disappears from master since there is no connection.

      In fact, preceding lines point to insufficient memory (and I have no idea how much we should throw at it, because with whatever settings we tried it works for days and weeks and then suddenly it does not; at the moment we have java -Xms64m -Xmx512m for the agent).

      Jan 05 08:59:37 jenkins2 bash[15117]: at java.lang.Thread.run(Thread.java:748)
      Jan 05 08:59:37 jenkins2 bash[15117]: Caused by: java.lang.OutOfMemoryError: Java heap space
      

            Unassigned Unassigned
            jimklimov Jim Klimov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: