ansaurus

Question

Answer 1

+2 A:

Have a look at this link (C code example).

/* zran.c -- example of zlib/gzip stream indexing and random access
...

Gzip is just zlib with an envelope.

ChristopheD 2010-03-26 21:46:51

Thanks, that's cool! If only I found a way to use that comfortably from Java..

jkff 2010-03-26 22:04:03

@jkff: If you don't need cross-platform deployment, check out JNA. It's surprisingly easy to use as a way to call C libraries.

Rex Kerr 2010-03-27 01:23:36

Thanks again, I did so and it works like a charm! Rex, thanks to you, too: I used JNA :)

jkff 2010-03-27 18:41:35

Whoohoo, now my unlimited-size-log viewer supports gzip'd logs!

jkff 2010-03-27 19:16:27

Answer 2

A:

interesting question. I don't understand why your 2nd option (recompress file in chunks) would double the disk space. Seems to me it would be the same, less a small amount of overhead. If you have control over the compression piece, then that seems like the right idea.

Maybe what you mean is that you don't have control over the input, and therefore it would double.

If you can do it, I'm imagining modelling it as a CompressedFileStream class that uses as its backing store, a series of 1mb gzip'd blobs. When reading, a Seek() on the stream would move to the appropriate blob and decompress. A Read() past the end of a blob would cause the stream to open the next blob.

ps: GZIP is described in IETF RFC 1952, but it uses DEFLATE for the compression format. There'd be no reason to use the GZIP elaboration if you implemented this CompressedFileStream class as I've imagined it.

Cheeso 2010-03-26 21:51:36

I don't like the 2nd option because I'm not going to delete the original files, and I don't have control over how they are generated. However, for now that's how I actually implemented the stuff (quite as you described), but I was not satisfied with that and that's why I asked the question :)

jkff 2010-03-26 22:00:40

Answer 3

+1 A:

The BGZF file format, compatible with GZIP was developped by the biologists.

(...) The advantage of BGZF over conventional gzip is that BGZF allows for seeking without having to scan through the entire file up to the position being sought.

In http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/java/net/sf/samtools/util/ , have a look at BlockCompressedOutputStream and BlockCompressedInputStream.java

Pierre 2010-04-22 10:02:41

Thanks, that's nice, but I need my tool to be immediately applicable to existing logfiles, and they're usually archived in .zip or .gzip by a third-party archiver. Besides, I've already got a working solution :)

jkff 2010-04-23 09:04:41

ansaurus

tags:

views:

answers:

Random access gzip stream

related questions