Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23917

Protocol deadlock while uploading artifacts from ppc64

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Fix
    • Icon: Major Major
    • core, remoting

      I've encountered an ssh2 channel protocol issue when a ppc64 slave communicates with an x64 master.

      Most operations, like sending build logs, work fine. When the time comes to upload artifacts at the end of the build the build stalls indefinitely at:

      Archiving artifacts
      

      If I get stack dumps of slave and master using jstack, I see the master waiting to read from the slave:

      "Channel reader thread: Fedora16-ppc64-Power7-osuosl-karman" prio=10 tid=0x00000000038c2800 nid=0x6de7 in Object.wait() [0x00007f825ef8b000]
         java.lang.Thread.State: WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              - waiting on <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
              at java.lang.Object.wait(Object.java:502)
              at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
              - locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
              at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
              at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946)
              - locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
              at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
              at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      and the slave is waiting for data from the master:

      "Channel reader thread: channel" prio=10 tid=0x00000fff940fedd0 nid=0x558e runnable [0x00000fff6dc6d000]
         java.lang.Thread.State: RUNNABLE
              at java.io.FileInputStream.readBytes(Native Method)
              at java.io.FileInputStream.read(FileInputStream.java:236)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
              - locked <0x00000fff78ba9f98> (a java.io.BufferedInputStream)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      of course I can't get those dumps at exactly the same moment, even if that were meaningful with network latencies and buffering, but repeated runs never show any other state for either thread.

      tshark shows that there's some SSH chatter going on:

        0.000000 SLAVE -> MASTER SSH 126 Encrypted response packet len=60
        0.176121 MASTER -> SLAVE SSH 94 Encrypted request packet len=28
        0.176151 SLAVE -> MASTER TCP 66 ssh > 37501 [ACK] Seq=61 Ack=29 Win=707 Len=0 TSval=4141397874 TSecr=2808266826
      

      but it should well be low level ssh keepalives or similar, as it's at precise 5 second intervals with nothing much else happening. There are three master->slave ssh connections, so it's not guaranteed that it's even the one associated with the stuck channel.

      My first thought is endianness.

      I don't really know how to begin debugging this issue, though.

        1. config.xml
          1.0 kB
        2. jenkins-master-idle-stack.txt
          47 kB
        3. jenkins-master-stack.txt
          49 kB
        4. jenkins-slave-stack.txt
          6 kB
        5. slavelog-from-master.txt
          4 kB
        6. slavelog-from-slave.txt
          1 kB

            Unassigned Unassigned
            ringerc Craig Ringer
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: