Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23917

Protocol deadlock while uploading artifacts from ppc64

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: core, remoting
    • Environment:
    • Similar Issues:

      Description

      I've encountered an ssh2 channel protocol issue when a ppc64 slave communicates with an x64 master.

      Most operations, like sending build logs, work fine. When the time comes to upload artifacts at the end of the build the build stalls indefinitely at:

      Archiving artifacts
      

      If I get stack dumps of slave and master using jstack, I see the master waiting to read from the slave:

      "Channel reader thread: Fedora16-ppc64-Power7-osuosl-karman" prio=10 tid=0x00000000038c2800 nid=0x6de7 in Object.wait() [0x00007f825ef8b000]
         java.lang.Thread.State: WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              - waiting on <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
              at java.lang.Object.wait(Object.java:502)
              at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
              - locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
              at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
              at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946)
              - locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
              at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
              at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      and the slave is waiting for data from the master:

      "Channel reader thread: channel" prio=10 tid=0x00000fff940fedd0 nid=0x558e runnable [0x00000fff6dc6d000]
         java.lang.Thread.State: RUNNABLE
              at java.io.FileInputStream.readBytes(Native Method)
              at java.io.FileInputStream.read(FileInputStream.java:236)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
              - locked <0x00000fff78ba9f98> (a java.io.BufferedInputStream)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      of course I can't get those dumps at exactly the same moment, even if that were meaningful with network latencies and buffering, but repeated runs never show any other state for either thread.

      tshark shows that there's some SSH chatter going on:

        0.000000 SLAVE -> MASTER SSH 126 Encrypted response packet len=60
        0.176121 MASTER -> SLAVE SSH 94 Encrypted request packet len=28
        0.176151 SLAVE -> MASTER TCP 66 ssh > 37501 [ACK] Seq=61 Ack=29 Win=707 Len=0 TSval=4141397874 TSecr=2808266826
      

      but it should well be low level ssh keepalives or similar, as it's at precise 5 second intervals with nothing much else happening. There are three master->slave ssh connections, so it's not guaranteed that it's even the one associated with the stuck channel.

      My first thought is endianness.

      I don't really know how to begin debugging this issue, though.

        Attachments

        1. config.xml
          1.0 kB
        2. jenkins-master-idle-stack.txt
          47 kB
        3. jenkins-master-stack.txt
          49 kB
        4. jenkins-slave-stack.txt
          6 kB
        5. slavelog-from-master.txt
          4 kB
        6. slavelog-from-slave.txt
          1 kB

          Activity

          Hide
          ringerc Craig Ringer added a comment -

          With -text, log shown for the node on the master when it dies:

          <===[JENKINS REMOTING CAPACITY]===><===[HUDSON TRANSMISSION BEGINS]===channel started
          Slave.jar version: 2.43
          This is a Unix slave
          Slave successfully connected and online
          Jul 23, 2014 5:57:48 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
          SEVERE: I/O error in channel channel
          java.io.StreamCorruptedException: invalid stream header: AC64736F
          	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800)
          	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:297)
          	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
          channel stopped
          ERROR: Connection terminated
          java.io.IOException: Unexpected termination of the channel
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
          Caused by: java.io.EOFException
          	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
          	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792)
          	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800)
          	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:298)
          	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
          [07/23/14 05:57:49] [SSH] Connection closed.
          ERROR: [07/23/14 05:57:49] slave agent was terminated
          java.io.IOException: Unexpected termination of the channel
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
          Caused by: java.io.EOFException
          	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
          	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792)
          	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800)
          	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:298)
          	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
          

          and from the slave its self:

          channel startedchannel started
          
          Jul 23, 2014 5:57:48 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
          SEVERE: I/O error in channel channel
          java.io.StreamCorruptedException: invalid stream header: AC64736F
                  at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800)
                  at java.io.ObjectInputStream.<init>(ObjectInputStream.java:297)
                  at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
          Jul 23, 2014 5:57:48 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
          SEVERE: I/O error in channel channel
          java.io.StreamCorruptedException: invalid stream header: AC64736F
                  at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800)
                  at java.io.ObjectInputStream.<init>(ObjectInputStream.java:297)
                  at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
          channel stoppedchannel stopped
          
          Show
          ringerc Craig Ringer added a comment - With -text, log shown for the node on the master when it dies: <===[JENKINS REMOTING CAPACITY]===><===[HUDSON TRANSMISSION BEGINS]===channel started Slave.jar version: 2.43 This is a Unix slave Slave successfully connected and online Jul 23, 2014 5:57:48 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.io.StreamCorruptedException: invalid stream header: AC64736F at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:297) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) channel stopped ERROR: Connection terminated java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:298) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) [07/23/14 05:57:49] [SSH] Connection closed. ERROR: [07/23/14 05:57:49] slave agent was terminated java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:298) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) and from the slave its self: channel startedchannel started Jul 23, 2014 5:57:48 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.io.StreamCorruptedException: invalid stream header: AC64736F at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:297) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) Jul 23, 2014 5:57:48 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.io.StreamCorruptedException: invalid stream header: AC64736F at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:800) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:297) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) channel stoppedchannel stopped
          Hide
          ringerc Craig Ringer added a comment -

          I can't for the life of me figure out how to use the -tcp option to make the slave passively accept a tcp connection from the master.

          In any case, -text still fails, so it seems unlikely to be an issue with SSH mangling binary.

          Show
          ringerc Craig Ringer added a comment - I can't for the life of me figure out how to use the -tcp option to make the slave passively accept a tcp connection from the master. In any case, -text still fails, so it seems unlikely to be an issue with SSH mangling binary.
          Hide
          ringerc Craig Ringer added a comment -

          When I switch the node to JNLP (so it uses direct TCP/IP as a transport, rather than SSH) I can no longer reproduce this despite repeated test runs.

          So, so far:

          • Only observed on ppc64
          • Only observed for ssh slaves
          • Using -text protocol does not help
          Show
          ringerc Craig Ringer added a comment - When I switch the node to JNLP (so it uses direct TCP/IP as a transport, rather than SSH) I can no longer reproduce this despite repeated test runs. So, so far: Only observed on ppc64 Only observed for ssh slaves Using -text protocol does not help
          Hide
          ringerc Craig Ringer added a comment -

          I've left a JNLP worker running for some hours, running the same job with 1mb and 10mb artifact sizes that caused intermittent problems over the ssh transport. No problems.

          Show
          ringerc Craig Ringer added a comment - I've left a JNLP worker running for some hours, running the same job with 1mb and 10mb artifact sizes that caused intermittent problems over the ssh transport. No problems.
          Hide
          jimis jimis jimis added a comment -

          Hi, I'm experiencing the same issue using Jenkins 1.580.1 on ppc64 running AIX 5.3. In particular I'm seeing the exact same bytes that you have posted in the attached file "slavelog-from-slave.log":

          java.io.StreamCorruptedException: invalid stream header: 009AACED

          Searching the web I found the following explanation for this sequence of bytes:

          Object stream data is preceded by a 4 byte 'magical' sequence AC ED 00 05. An ObjectInputStream will peek for this data at construction time rather than before the first read. And that's logical: one wants to be sure it is a proper stream before being too far in an application. The sequence is buffered by the ObjectOutputStream at construction time so that it is pushed on the stream at the first write. This method often leads to complexities in buffered situations or transferring via pipes or sockets. Fortunately there is a just as simple as effective solution to all these problems: Flush the ObjectOutputStream immediately after contruction!

          Looking at the similarity in the byte sequence, it looks like either an endianess issue or an off-by-one error.

          Thank you for providing the workaround, I'll set up the buildslave using JNLP.

          Show
          jimis jimis jimis added a comment - Hi, I'm experiencing the same issue using Jenkins 1.580.1 on ppc64 running AIX 5.3. In particular I'm seeing the exact same bytes that you have posted in the attached file "slavelog-from-slave.log": java.io.StreamCorruptedException: invalid stream header: 009AACED Searching the web I found the following explanation for this sequence of bytes: Object stream data is preceded by a 4 byte 'magical' sequence AC ED 00 05. An ObjectInputStream will peek for this data at construction time rather than before the first read. And that's logical: one wants to be sure it is a proper stream before being too far in an application. The sequence is buffered by the ObjectOutputStream at construction time so that it is pushed on the stream at the first write. This method often leads to complexities in buffered situations or transferring via pipes or sockets. Fortunately there is a just as simple as effective solution to all these problems: Flush the ObjectOutputStream immediately after contruction! Looking at the similarity in the byte sequence, it looks like either an endianess issue or an off-by-one error. Thank you for providing the workaround, I'll set up the buildslave using JNLP.

            People

            • Assignee:
              Unassigned
              Reporter:
              ringerc Craig Ringer
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: