Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-3890

Hudson fails to launch the slave agent a second time, even though it availablity is to go off line on idle.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Component/s: remoting
    • Labels:
      None
    • Environment:
      Platform: All, OS: Linux

      Description

      I had this working with older versions of Hudson, it broke on upgrade.

      I can manually launch the agent, and it is configured as such.

      Launch method = ssh
      Availability = Take node off line when idle

      then when the agent goes off line Hudson fails to restart it. I then have to
      manually click the "launch slave agent" to restart it. Once this is done, the
      agent runs until the agent is idle, when it will go off line again, and will
      never come on line again until I click the button "launch slave agent".

      The agents are running on Scientific Linux (Red Hat binary compatible OS) 32 and
      64 bit, versions 4 and 5.

      The following is a typical log file

      [06/16/09 18:43:32] [SSH] Connection closed.

      <lots of jibberish characters>

      channel stopped
      [06/16/09 18:43:32] slave agent was terminated
      hudson.remoting.Channel$OrderlyShutdown
      at hudson.remoting.Channel$CloseCommand.execute(Channel.java:623)
      at hudson.remoting.Channel$ReaderThread.run(Channel.java:758)
      Caused by: Command close created at
      at hudson.remoting.Command.<init>(Command.java:47)
      at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:619)
      at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:619)
      at hudson.remoting.Channel.close(Channel.java:663)
      at hudson.remoting.Channel$CloseCommand.execute(Channel.java:622)
      ... 1 more

      This is 100% repeatable, we are running Hudson ver. 1.310 and this bug has been
      around for quiet some releases, maybe 3-4 months.

        Issue Links

          Activity

          Hide
          mirzmaster mirzmaster added a comment -

          I can confirm the occurrence of this issue. The agents are running on Ubuntu
          Hardy, Intrepid and Jaunty, all 32-bit.

          The SSH plugin appeared to be working just fine up to Hudson 1.304. A
          subsequent release appears to have caused this regression (perhaps coinciding
          with the inclusion of the SSH plugin as a default Hudson plugin).

          Show
          mirzmaster mirzmaster added a comment - I can confirm the occurrence of this issue. The agents are running on Ubuntu Hardy, Intrepid and Jaunty, all 32-bit. The SSH plugin appeared to be working just fine up to Hudson 1.304. A subsequent release appears to have caused this regression (perhaps coinciding with the inclusion of the SSH plugin as a default Hudson plugin).
          Hide
          mindless Alan Harder added a comment -

          add cc

          Show
          mindless Alan Harder added a comment - add cc
          Hide
          mindless Alan Harder added a comment -
              • Issue 4302 has been marked as a duplicate of this issue. ***
          Show
          mindless Alan Harder added a comment - Issue 4302 has been marked as a duplicate of this issue. ***
          Hide
          mindless Alan Harder added a comment -

          From #4302:
          "We discovered this in Hudson v1.309. When a slave is configured with an idle
          delay, it eventually shuts down due to inactivity. When this happens, the
          Hudson master seems to be unable to distinguish this condition from a genuine
          failure, and marks the slave as offline. If the slave is also configured with
          an on-demand delay, then Hudson will attempt to start the slave if its job queue
          is non-empty for the specified time, but will fail because it thinks the slave
          has failed."

          Show
          mindless Alan Harder added a comment - From #4302: "We discovered this in Hudson v1.309. When a slave is configured with an idle delay, it eventually shuts down due to inactivity. When this happens, the Hudson master seems to be unable to distinguish this condition from a genuine failure, and marks the slave as offline. If the slave is also configured with an on-demand delay, then Hudson will attempt to start the slave if its job queue is non-empty for the specified time, but will fail because it thinks the slave has failed."
          Hide
          mindless Alan Harder added a comment -

          working on this.. problem is that SlaveComputer.lastConnectActivity never gets
          reset to null, so when it tries to relaunch the node it thinks it is already
          trying to connect. (the Launch now button, on the other hand, does a
          "forceReconnect" so it launches even though it thinks it is already trying to
          connect)

          Show
          mindless Alan Harder added a comment - working on this.. problem is that SlaveComputer.lastConnectActivity never gets reset to null, so when it tries to relaunch the node it thinks it is already trying to connect. (the Launch now button, on the other hand, does a "forceReconnect" so it launches even though it thinks it is already trying to connect)
          Hide
          scm_issue_link SCM/JIRA link daemon added a comment -

          Code changed in hudson
          User: : mindless
          Path:
          trunk/hudson/main/core/src/main/java/hudson/slaves/SlaveComputer.java
          trunk/www/changelog.html
          http://fisheye4.cenqua.com/changelog/hudson/?cs=23875
          Log:
          [FIXED JENKINS-3890] Fix in-demand RetentionStrategy and more offline-slave fixes.

          • Use isConnecting() instead of lastConnectActivity!=null in checking forceReconnect
            param.. var never resets to null, so this prevented forceReconnect==false from ever
            reconnecting (which broken in-demand strategy since 1.302 when it started using false)
          • Only set ChannelTermination offline cause when the exception param is non-null.
            On orderly shutdown this was replacing the real cause of going offline.
          • In setNode try to delegate to RetentionStrategy rather than calling connect directly.
            Reconfiguring a node was launching slave even if it wasn't needed per its strategy.
          Show
          scm_issue_link SCM/JIRA link daemon added a comment - Code changed in hudson User: : mindless Path: trunk/hudson/main/core/src/main/java/hudson/slaves/SlaveComputer.java trunk/www/changelog.html http://fisheye4.cenqua.com/changelog/hudson/?cs=23875 Log: [FIXED JENKINS-3890] Fix in-demand RetentionStrategy and more offline-slave fixes. Use isConnecting() instead of lastConnectActivity!=null in checking forceReconnect param.. var never resets to null, so this prevented forceReconnect==false from ever reconnecting (which broken in-demand strategy since 1.302 when it started using false) Only set ChannelTermination offline cause when the exception param is non-null. On orderly shutdown this was replacing the real cause of going offline. In setNode try to delegate to RetentionStrategy rather than calling connect directly. Reconfiguring a node was launching slave even if it wasn't needed per its strategy.

            People

            • Assignee:
              mindless Alan Harder
              Reporter:
              owensynge owensynge
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: