tags:

views:

69

answers:

1

We have a a large set of objects that include composition and name properties, both string values that contain values with a lot of duplication, what would be a suitable data structure to store the strings which can be searchable and small?

The data includes many chemical and product names that are duplicates or differ only slightly. I'd like to be able to store the string content of the objects in a compressed format that can also be searched.

I've experimented with Tries to make a fast searchable index over the names but this is currently in addition to the storage of each objects string data.

This data is static and distributed as a separate binary file with the application.

+1  A: 

I've previously had some success with a mix of LZW compression with a large table, and then interning to 32 bit identifiers. For a similar enough corpus, the LZW can compress into fewer than 32 bits, so there's a flag on the id so it is treated as a compressed bit pattern rather than a key in a hashtable. As LZW is prefix based, you can search it in a somewhat similar fashion to a trie, but it's a bit trickier; it's easier to just do a test based on whether a bit pattern contains any of the query characters when expanded, and if so expand the string and use conventional string comparison.

Pete Kirkham
Cheers Pete, could you expand a little on what you mean here, particularly the searching; how would I include fast searching like the trie?
Dog Ears
If you re-compress the strings after the LZW table has been finalised, then the compressed form will act in a somewhat similar manner to a trie - all strings which start in 'abc' when compressed will either start with the compressed code for 'abc' or a code derived from that. The LZW code doesn't care what stems map to what values, so you can arrange the stems in lexical order, so all such strings are within a range. It's not as efficient as a trie, as it requires the strings are sorted, but it doesn't require the strings to be decompressed to do a stemming search.
Pete Kirkham