views:

43

answers:

3

I have a smallish number (2-10) of largeish files (6-15GB) that compress really well (4:1).

I'm writing the client and server in Java, and I want to send the files from the client to the server, so that 1. the client compresses the files as they are sent (i.e. no intermediate .zip file is created)
2. the compressed stuff on the server ends up as a well formed file (e.g. a .zip or .tgz file) so that it can be downloaded 'as is'.
3. the transfer can be resumed if it fails mid-way through
4. the resumed transfer might happen in a new process altogether

The first two can be achieved pretty easily using java.io sockets and java.util.zip ZipOutputStreams. The third is the one causing me grief. The fourth is really just context.

I'm guessing a solution will possibly require some sort of partial re-transmit or re-parse to establish a dictionary, or something.

Are there any Java libraries that support resumable compression?

A: 

I don't know of anything that will allow you to resume compression in the middle of a stream; that seems like a very state-sensitive thing.

Instead, you might consider "breaking" the file up into smaller chunks and sending those individually (with compression). Say, 100kb chunks (for example). You still can't resume in the middle of a chunk, but you could easily start at the beginning of the most recent chunk.

Mike Caron
A: 

Compressing on the fly is easy. The problem you're going to have is resuming the upload. That basically eliminates HTTP as a transport so you'll need to look at something like (S)FTP or SCP. Even there the problem is you're not creating a file on the client so what's going to be resumed? In the very least you will need to use a compression method that is deterministic (meaning that given a specified file, any two runs of the compression algorithm will produce exactly the same output). If this isn't true, you can't resume at all.

My advice is to take a slightly tangential approach. Divide the file into manageable chunks (say 50MB). That is deterministic. Compress each chunk individually. If a chunk fails, resend it. There is no resumption but you can get partial uploads by the server telling the client what chunks its received or waiting for.

One problem you'll have is identifying a particular file. Will the file name do? Is there some other identifying characteristic? If two clients try and upload the same file will the server be able to detect this? The standard approach for this kind of thing is to use a checksum (SHA1 hash of the file contents) but you don't want to read a 16GB file in its entirety just to do a checksum. So some other method would be preferable.

Imagine the network communication goes something like this:

Client: SEND file1234 CHUNKS 167
Server: RECEIVED (already got) or WAIT 7 (chunk #)
Client: compress and send chunk 7
Server: WAIT 8
....

This method will also handle multiple clients uploading the file at the same time as the server can ask for different chunks from different clients and merge them together.

The one problem with this method is the file isn't "complete" on the server (as a zip or tarball) but I think you need to give that up to end up with something that'll actually work and not be a nightmare to code.

cletus
It is safe to assume that a file can be identified by its name (it's actually a little more complicated than that, but the negotiation for an ID is handled elsewhere in the system). (Also, please see the comment on the question itself as a reply to the chunking idea)
Burleigh Bear
+1  A: 

I couldn't find any pre-fabbed libraries that support resumable compression in the manner I required. There are, however, many of the bits and pieces available under open licenses to write your own. I've now got a client/server solution that satisfies all the constraints outlined in the question.

The idea is similar to the chunking ideas outlined above, but the server manages the chunking and does some bookkeeping that maps compressed chunks on the client to compressed chunks on the server. There are no temp files anywhere in the solution. The basic protocol is as follows

(1) The client sends a manifest to the server, containing the
    to-be contents of the zip file  
(2) The server sends back an ID for the manifest  
    Then repeatedly  
  (3) The client asks the server "is there anything still 
      required for the manifest with ID X"
  (4) The server replies "no", or with a manifest entry 
      for a file in the manifest, plus a offset and length to send
  (5) The client compresses that chunk and sends it (plus some 
       bookkeeping info)
  (6) The server puts the chunk into the every growing zip file, 
      plus appropriate zip file crud. If the server orders 
      the chunks it asks the client for appropriately, this can 
      all be done by file appends.

The server only updates the manifest each time step 6 completes successfully, so failures during steps 3-6 (including crashes on the server or client) can be resumed safely (well, more or less).

There are a few bits that were a little fiddly in doing chunk-wise zip file creation. The basic thing that needs to be achieved is to find a chunk-able compression algorithm. Deflate can be used in this manner.

The java ZipOutputStream and DeflaterOutputStream are not really suitable for 'chunk-wise' deflation/zipping, since they don't allow an arbitrary flushing. There is a BSD-style licensed java implementation of ZLib at http://www.jcraft.com/jzlib. I haven't benchmarked it for speed, but it gives the same output as the Java implementation. JZLib is great, and is supports all the flushing modes of ZLib (unlike the java.util.zip.Deflate implementation).

Also, Zip files compute a CRC for each entry. So, the manifest entry in step 4 contains a 'partial' CRC, which for each chunk is updated and sent back in the book-keeping info in step 5. There is a public domain implementation of CRC for java at http://www.axlradius.com/freestuff/CRC32.java. I've benchmarked it and it is as fast as (and provides equivalent CRCs) the native java implementation.

Finally, the Zip file format is quite pernickety. I managed to piece together most of an implementation from the wikipedia page and http://www.pkware.com/documents/casestudies/APPNOTE.TXT. Although at one point I couldn't work the right value for one of the fields. Luckily the JDK's ZipOutputStream source is available so you can see what they do.

Burleigh Bear