views:

76

answers:

1

Say I have a humongous bzip2 file (over 5GB), and say I want to decompress only block #x, because there is where my data is (block is different every time). How would I do this.

I have thought about making an index of where all the blocks are, then cut the block I need from the file and apply bzip2recover to it.

I have also thought about compressing say 1MB at a time, then appending this to a file (and record the location), and simply grab the file when I need it, but I'd rather keep the original bzip2 file intact.

My preferred language is ruby, but any language's solution is fine by me (as long as I understand the principle).

So, does anybody have any ideas?

+2  A: 

There is a http://bitbucket.org/james_taylor/seek-bzip2

Grab the source, compile it.

Run with

./seek-bzip2  32 < bzip_compressed.bz2 

to test.

the only param is bit displacement of wondered block header. You can get it with finding a "31 41 59 26 53 59 " hex string in the binary file.

32 is bit size of "BZh1" header where 1 can be any digit from 1 to 9 - it is a block size in hundreds of kb.

osgx
sic! block start can be not a byte boundary :( There is a bzip-table programm included in "seek-bzip2" to get list of bit displacement and sizes of original data block sizes.
osgx
unfortunatly, "bzip-table" is almost the same speed as actual decompressing :(. It do almost full decompress cycle, but don't check CRC.
osgx
osgx