tags:

views:

65

answers:

3

We are streaming data between a server (written in .Net running on Windows) to a client (written in Java running on Ubuntu) in batches. The data is in XML format. Occasionally the Java client throws an unexpected EOF while trying decompress the stream. The message content always varies and is user driven. The response from the client is also compressed using GZip. This never fails and seems to be rock solid. The response from the client is controlled by the system.

Is there a chance that some arrangement of characters or some special characters are creating false EOF markers? Could it be white-space related? Is GZip suitable for compressing XML?

I am assuming that the code to read and write from the input/output streams works because we only occasionally gets this exception and when we inspect the user data at the time there seems to be special characters (which is why I asked the question) such as the '@' sign.

Any ideas?

UPDATE: The actual code as requested. I thought it wasn't this due to the fact that I had been to a couple of sites to get help on this issue and they all more or less had the same code. Some sites mentioned appended GZip. Something to do with GZip creating multiple segments?

public String receive() throws IOException {

    byte[] buffer = new byte[8192];
    ByteArrayOutputStream baos = new ByteArrayOutputStream(8192);

    do {
        int nrBytes = in.read(buffer);
        if (nrBytes > 0) {
            baos.write(buffer, 0, nrBytes);
        }
    } while (in.available() > 0);
    return compressor.decompress(baos.toByteArray());
}
   public String decompress(byte[] data) throws IOException {
    ByteArrayOutputStream buffer = new ByteArrayOutputStream();
    ByteArrayInputStream in = new ByteArrayInputStream(data);

    try {
        GZIPInputStream inflater = new GZIPInputStream(in); 
        byte[] byteBuffer = new byte[8192];
        int r;
        while((r = inflater.read(byteBuffer)) > 0 ) {
            buffer.write(byteBuffer, 0, r); 
        }
    } catch (IOException e) {
        log.error("Could not decompress stream", e);
        throw e;
    }
    return new String(buffer.toByteArray());
}

At first I thought there must be something wrong with the way that I am reading in the stream and I thought perhaps I am not looping properly. I then generated a ton of data to be streamed and checked that it was looping. Also the fact they it happens so seldom and so far has not been reproducable lead me to believe that it was the content rather than the scenario. But at this point I am totally baffled and for all I know it is the code.

Thanks again everyone.

Update 2:

As requested the .Net code:

Dim DataToCompress = Encoding.UTF8.GetBytes(Data)
Dim CompressedData = Compress(DataToCompress)

To get the raw data into bytes. And then it gets compressed

      Private Function Compress(ByVal Data As Byte()) As Byte()
            Try
                Using MS = New MemoryStream()
                    Using Compression = New GZipStream(MS, CompressionMode.Compress)
                        Compression.Write(Data, 0, Data.Length)
                        Compression.Flush()
                        Compression.Close()
                        Return MS.ToArray()
                    End Using
                End Using
            Catch ex As Exception
                Log.Error("Error trying to compress data", ex)
                Throw
            End Try
        End Function

Update 3: Also added more java code. the in variable is the InputStream return from socket.getInputStream()

+1  A: 

It certainly shouldn't be due to the data involved - the streams deal with binary data, so that shouldn't make any odds at all.

However, without seeing your code, it's hard to say for sure. My first port of call would be to check anywhere that you're using InputStream.read() - check that you're using the return value correctly, rather than assuming a single call to read() will fill the buffer.

If you could provide some code, that would help a lot...

Jon Skeet
Hi Jon, sure. Will do. I did double check this but perhaps it is incorrect after all. Thanks
uriDium
Hi Jon. The code is up. I added both the .Net compression part and the Java decompression.
uriDium
I think I have found my problem. Upon closer inspection of the available method it says that returns the number of bytes ready to be read without blocking. I need to be sending a size indicator to the client and continue reading until all bytes have been read.
uriDium
@uriDium: Aargh, definitely don't use `available()` - I can't remember *ever* finding that useful!
Jon Skeet
Sounds like the most likely culprit.
Thorbjørn Ravn Andersen
uriDium
...cont: you have to also send the length of the data so that the client knows how many bytes to receive. This leaves me with burning question. Is it guaranteed that the client will always receive enough data in the first read to be able to get the first couple bytes to get a length indicator? It seems to me that nothing is really guaranteed with sockets.
uriDium
@uriDium: Yes - if you're exchanging methods over a persistent connection, you *either* need delimiters to mark "end of message" *or* you need to length-prefix each message. And no, you can't assume you'll get those bytes all in one call to `read()`. But that's relatively easy to work around, because you'll know when you're done.
Jon Skeet
@Jon: No one really gave me an answer but the answer seems to have come from the guidance in these comments. That is why I will chose your answer to give you some credit for the advice. Lastly, if we assume that the first 4 bytes (int 32) contains the length of the rest of the payload. Then we can call read() four times to get that info. Reconstruct the int value and call read with the correct buffer size etc.
uriDium
@uriDium: Exactly. Just check the results of calling `read()` each time to make sure the stream hasn't been closed abruptly for some reason :)
Jon Skeet
A: 

I would suspect that for some reason the data is altered underway, by treating it as text, not as binary, so it may either be \n conversions or a codepage alteration.

How is the gzipped stream transferred between the two systems?

Thorbjørn Ravn Andersen
I also thought it might have something to do with those types of conversions. When I compress I convert the string raw data to bytes, in UTF8. I then compress the byte array and send it via socket. It is picked up on the other side via a socket and decompressed using the above code snippet.
uriDium
"string raw data to bytes, in UTF8" - this sounds highly suspicious to me. Show the code. So the bytes is send directly to a socket and back up? No web server or anything?
Thorbjørn Ravn Andersen
The GZipStream.write method expects an array of bytes correct? So I did a byte[] bytes = Encoding.UTF8.GetBytes(rawString); I am not in front of the code right now. I will be able to post the code later. Yes everything is sent directly. No web server or anything.
uriDium
Okay, posted the code that gets the bytes.
uriDium
A: 

It is not pssible. EOF in TCP is delivered as an out of band FIN segment, not via the data.

EJP
That went totally over my head :) Do you have a source that I could read.
uriDium
It is the gzip stream that is corrupted resulting in the decoder getting the EOF wrong.
Thorbjørn Ravn Andersen
@uriDum, he is talking about TCP/IP not gzip.
Thorbjørn Ravn Andersen