views:

254

answers:

3

I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.

Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?

While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.

TIA Ross

A: 

You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.

EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.

Marcelo Cantos
Isn't Judy an "associative array"? If indeed so, it's an replacement for TDictionary<>, not for TStringList. And the only reference I found to JudySL is this: http://judy.sourceforge.net/doc/JudySL_3x.htm - that's not an replacement for TStringList.
Cosmin Prund
A: 

I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:

  • You split your string into words and keep separated growning dictionary to reference the words and save list of indexes internally

  • You implement something related to zlib stream available in Delphi, but operating by the block that for example can contains 10-100 strings. In this case you still have to recompress/compress the complete block, but the "price" you pay is lower.

Maksee
I don't actually need that much flexibility. If I were retrieving a directory list of (say) 10,000,000 file and pathnames and I wanted to store them in compressed form the first approach you mention would work. I was just wondering if there was an out-of-the-box solution anywhere.
+2  A: 

I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.

The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.

Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.

If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16 on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).

PatrickvL