Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-24926

Connection to slave constantly breaks, java process can only be kill -9-ed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • core
    • None

      Since approximately 9 months we have constant troubles with our master losing connectivity to our mac mavericks slave (via ssh). The issue we observe is that after some time, the master cannot communicate with the slave anymore, so that jobs fail with the following error message:

      Building remotely on MAC_OS_mavericks_64bit (macos mavericks java7)FATAL: channel is already closed
      hudson.remoting.ChannelClosedException: channel is already closed
      	at hudson.remoting.Channel.send(Channel.java:541)
      	at hudson.remoting.Request.call(Request.java:129)
      	at hudson.remoting.Channel.call(Channel.java:739)
      	at hudson.EnvVars.getRemote(EnvVars.java:404)
      	at hudson.model.Computer.getEnvironment(Computer.java:912)
      	at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
      	at hudson.model.Run.getEnvironment(Run.java:2221)
      	at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:885)
      	at hudson.matrix.MatrixRun$MatrixRunExecution.decideWorkspace(MatrixRun.java:175)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:513)
      	at hudson.model.Run.execute(Run.java:1706)
      	at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
      	at hudson.model.ResourceController.execute(ResourceController.java:88)
      	at hudson.model.Executor.run(Executor.java:232)
      Caused by: java.io.IOException
      	at hudson.remoting.Channel.close(Channel.java:1027)
      	at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
      	at hudson.remoting.PingThread.ping(PingThread.java:120)
      	at hudson.remoting.PingThread.run(PingThread.java:81)
      Caused by: java.util.concurrent.TimeoutException: Ping started on 1412069451954 hasn't completed at 1412069691955
      	... 2 more
      

      At some point, the slave is then marked as offline. When trying to reconnect, nothing happens. You see an empty log window with just the circling loading animation. No output is generated ever.

      We could not observer any issues with the underlying network connection. Everytime I observe this error, ssh-ing to the slave as the jenkins user is possible without any problems.

      This also only happens for the mavericks slave. All other Linux and Windows slave work perfectly.

      What is extremely confusing is that in case jenkins ended up in this condition, you cannot restart it in a clean fashion. You first have to kill the java process with SIGKILL, even though it is apparently not completely stuck since operation for everything apart from the mavericks slave continues to work perfectly.

      The general log file for jenkins only shows that also the jobs for checking disk space etc. suffer from the connectivity issue:

      WARNING: Failed to monitor MAC_OS_mavericks_64bit for Architecture
      hudson.remoting.ChannelClosedException: channel is already closed
              at hudson.remoting.Channel.send(Channel.java:541)
              at hudson.remoting.Request.callAsync(Request.java:208)
              at hudson.remoting.Channel.callAsync(Channel.java:766)
              at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMo
      nitorDescriptor.java:76)
              at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDe
      scriptor.java:280)
      Caused by: java.io.IOException
              at hudson.remoting.Channel.close(Channel.java:1027)
              at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
              at hudson.remoting.PingThread.ping(PingThread.java:120)
              at hudson.remoting.PingThread.run(PingThread.java:81)
      Caused by: java.util.concurrent.TimeoutException: Ping started on 1412069451954 hasn't complet
      ed at 1412069691955
              ... 2 more
      

      Apart from this, no errors are visible for that slave.

      A thread dump from the situation where the master tries to reconnect to the salve but nothing happens is available here:
      http://pastebin.com/DxFU8j7C

            Unassigned Unassigned
            languitar Johannes Wienke
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: