views:

599

answers:

7

What is the best compression algorithm that allows random reads/writes in a file?

I know that any adaptive compression algorithms would be out of the question.

And I know huffman encoding would be out of the question.

Does anyone have a better compression algorithm that would allow random reads/writes?

I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?

In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .

+2  A: 

I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.

e.g.
We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.

Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.

This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.

Ferruccio
I had posted an answer about Huffman coding. Reading your response made me pause and think about how Huffman coding is done, and you're right, random writes would spoil the encoding.
Bill the Lizard
In the case of writes you will never need extra padding. You just will have to re-compress both blocks who share the boundary that is crossed. This is because there is no API that will insert data into a position of a file.
Brian R. Bondy
@Brian R. Bondy: Surely writes are worse than that because they can change the size of the compressed file (even if the uncompressed data remains the same size).
Hugh Allen
Ah right, thanks Hugh.
Brian R. Bondy
@Brian - I was thinking you could remap the block to a different position in the file.
Ferruccio
+1  A: 

No compression scheme will allow fine-grained random access, for two related reasons:

  • you can't know exactly how far into the compressed file your desired piece of data is, therefore
  • there is no way to know where a symbol starts (at what bit position for Huffman, worse for arithmetic coding).

I can only suggest treating the file like a broadcast stream and inserting frequent synchronization / position markers, with obvious overhead (the sync marks not only take up space themselves, but complicate the encoding because it has to avoid "accidental" sync marks!). Alternatively, and to avoid seeking being something like a binary search (with the optimization that you can take a better guess where to start than the middle), you could include a "table of contents" at the start or end of the file.

As for random-access writing... I can't think of any neat solution :(

Hugh Allen
+1  A: 

Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.

However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.

As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.

However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.

+2  A: 

A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.

If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.

Stephen Denne
A: 

In light of your responses (from way way back ... :-)) I wonder what you think of STORWIZE solution ?

Itzhack
+2  A: 

I think Stephen Denne might be onto something here. Imagine:

  • zip-like compression of sequences to codes
  • a dictionary mapping code -> sequence
  • file will be like a filesystem
    • each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
    • "filesystem" keeps track of which "file" belongs to which bytes (start, end)
    • each "file" is compressed according to dictionary
    • reads work filewise, uncompressing and retrieving bytes according to "filesystem"
    • writes make "files" invalid, new "files" are appended to replace the invalidated ones
  • this system will need:
    • defragmentation mechanism of filesystem
    • compacting dictionary from time to time (removing unused codes)
  • done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually

One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.

This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.

Daren Thomas
+2  A: 

I am stunned at the number of responses that imply that such a thing is impossible.

Have these people never heard of "compressed file systems", which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?

I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.

Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details. But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.

Also check out: "StackOverflow: Compression formats with good support for random access within archives?"

David Cary