views:

103

answers:

2

I have about 200,000 text files that are placed in a bz2 file. The issue I have is that when I scan the bz2 file to extract the data I need, it goes extremely slow. It has to look through the entire bz2 file to fine the single file I am looking for. Is there anyway to speed this up?

Also, I thought about possibly organizing the files in the tar.bz2 so I can instead have it know where to look. Is there anyway to organize files that are put into a bz2?

More Info/Edit: I need to query the compressed file for each textfile. Is there a better compression method that supports such a large number of files and is as thoroughly compressed?

+6  A: 

Do you have to use bzip2? Reading it's documentation, it's quite clear it's not designed to support random access. Perhaps you should use a compression format that more closely matches your requirements. The good old Zip format supports random access, but might compress worse, of course.

unwind
Yes, I was about to write the same comment.
Eike
7zip is another alternative that compresses better than even bzip2 and has python bindings. I don't know how well it supports random access though.
Gilles
7zip is just a container for bzip2 or LZMA -- I think it tries several algorithms and uses the one with the best results.
katrielalex
O I see what you mean. I had actually wrote this program first using zip compression but I ran into the issue with the number of files. I couldnt get around the fact you can only have 64k files in a zip file. I need room for 200k files.
xZel
@xZek - I don't know of any file archiver other than zip that supports random access in the way you want. Why not just have the files out available on the filesystem?
Omnifarious
A: 

Bzip2 compresses in large blocks (900 KiB by default, I believe). One method that would speed up the scanning of the tar file dramatically, but would reduce compression performance, would be to compress each file individually and then tar the results together. This is essentially what Zip-format files are (though using zlib compression rather than bzip2). But you could then easily grab the tar index and only have to decompress the specific file(s) you are looking for.

I don't think most tar programs offer much ability to organize files in any meaningful way, though you could write a program to do this for your special case (I know Python has tar-writing libraries though I've only used them once or twice). However, you'd still have the problem of having to decompress most of the data before you found what you were looking for.

Jack Lloyd