views:

127

answers:

6

Please recommend a technology suitable for the following task.

I have a rather big (500MB) data chunk, which is basically a matrix of numbers. The data entropy is low (it should be well-compressible) and the storage is expensive where it sits.

What I am looking for, is to compress it with a good compression algorithm (Like, say, GZip) with markers that would enable very occasional random access. Random access as in "read byte from location [64bit address] in the original (uncompressed) stream". This is a little different than the classic deflator libraries like ZLIB, which would let you decompress the stream continuously. What I would like, is have the random access at latency of, say, as much as 1MB of decompression work per byte read.

Of course, I hope to use existing library rather than reinvent the NIH wheel.

A: 

Compression algorithms usually work in blocks I think so you might be able to come up with something based on block size.

Noah Roberts
While the raw data is processed in blocks, the compressed data blocks aren't the same size so you can't jump around in the compressed data to find a specific block.
Marcus Adams
If the file is divided into blocks that each represent 65,536(*) bytes of source data, one can spend four bytes per block on a table saying where each one starts.(*) One could just as well use a block size of 1,048,576 bytes, but even with 64K blocks a half-gig file would only need a 32Kbyte table.
supercat
A: 

I would recommend using the Boost Iostreams Library. Boost.Iostreams can be used to create streams to access TCP connections or as a framework for cryptography and data compression. The library includes components for accessing memory-mapped files, for file access using operating system file descriptors, for code conversion, for text filtering with regular expressions, for line-ending conversion and for compression and decompression in the zlib, gzip and bzip2 formats.

The Boost library been accepted by the C++ standards committee as part of TR2 so it will eventually be built-in to most compilers (under std::tr2::sys). It is also cross-platform compatible.

Boost Releases

Boost Getting Started Guide NOTE: Only some parts of boost::iostreams are header-only library which require no separately-compiled library binaries or special treatment when linking.

Elpezmuerto
`boost::iostreams` is NOT a header only library; though parts of it are.
Billy ONeal
@Billy updated accordingly. Thanks for the catch
Elpezmuerto
+1  A: 

Byte Pair Encoding allows random access to data.

You won't get as good compression with it, but you're sacrificing adaptive (variable) hash trees for a single tree, so you can access it.

However, you'll still need some kind of index in order to find a particular "byte". Since you're okay with 1 MB of latency, you'll be creating an index for every 1 MB. Hopefully you can figure out a way to make your index small enough to still benefit from the compression.

One of the benefits of this method is random access editing too. You can update, delete, and insert data in relatively small chunks.

If it's accessed rarely, you could compress the index with gzip and decode it when needed.

Marcus Adams
Very nice article
Eugen Constantin Dinca
Can you recommend me an existing implementation?
Pavel Radzivilovsky
+1  A: 

If you want to minimize the work involved, I'd just break the data into 1 MB (or whatever) chunks, then put the pieces into a PKZIP archive. You'd then need a tiny bit of front-end code to take a file offset, and divide by 1M to get the right file to decompress (and, obviously, use the remainder to get to the right offset in that file).

Edit: Yes, there is existing code to handle this. Recent versions of Info-zip's unzip (6.0 is current) include api.c. Among other things, that includes UzpUnzipToMemory -- you pass it the name of a ZIP file, and the name of one of the file in that archive that you want to retrieve. You then get a buffer holding the contents of that file. For updating, you'll need the api.c from zip3.0, using ZpInit and ZpArchive (though these aren't quite as simple to use as the unzip side).

Alternatively, you can just run a copy of zip/unzip in the background to do the work. This isn't quite as neat, but undoubtedly a bit simpler to implement (as well as allowing you to switch formats pretty easily if you choose).

Jerry Coffin
That's a good idea. Unfortunately, I don't know of any existing libraries that do this, but it would be easy to implement with a wrapper class.
Marcus Adams
Editing the data might be a little trickier, but it's doable.
Marcus Adams
A: 
  1. Sort the big file first
  2. divide it in chunks of your desire size (1MB) with some sequence in the name (File_01, File_02, .., File_NN)
  3. take first ID from each chunk plus the filename and put both data into another file
  4. compress the chunks
  5. you will able to made a search into the ID's file using the method that you wish, may be a binary search and open each file as you need.

If you need a deep Indexing you could use a BTree algorithm with the "pages" are the files. on the web exists several implementation of this because are little tricky the code.

Gustavo V
I don't think we'd need multiple compressed files if we had the index.
Marcus Adams
Gustavo, thanks, but I know how to design it myself. I need someone to point me to an existing library... I cannot be the first one confronting this..
Pavel Radzivilovsky