views:

329

answers:

8

I bet somebody has solved this before, but my searches have come up empty.

I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.

Example: doll dollhouse house

These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.

What I've come up with so far is:

  1. Sort the words longest to shortest: (dollhouse, house, doll)
  2. Scan the buffer to see if the string already exists as a substring, if so note the location.
  3. If it doesn't already exist, add it to the end of the buffer.

Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.

This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.

+1  A: 

I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).

Qubeuc
I believe that only works with strings that start with common substrings. Strings that end with common substrings will not be recognized. Correct me if I'm wrong.
Zifre
If strings end with a common substring, they wouldn't be matched up anyways based on this description. Doing so would cause the individual strings to become messed up.
Daniel Lew
To elaborate, if you had "woman" and "lawman", you cant combine them even if you wanted to. The only way combination works (as I understand the problem) is if a suffix of one word matches a prefix of another.
Daniel Lew
+1  A: 

My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.

Konrad Rudolph
What you are suggesting sounds like it could be implemented with a double radix tree (one forward and on backward). This would work in most cases, but if the strings have common parts in the middle, but not on the edges, it won't work.
Zifre
For an example, it wouldn't recognize consuming and sum.
Zifre
+1  A: 

Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.

friol
Could you just explain to us the link with the Knapsack Problem?
akappa
The Knapsack problem (optimally packing some goods in a bag) looked similar to me. In fact (see j_random_hacker's answer) this is a NP-complete problem, like the Knapsack one.
friol
Yes, but I still can't see the similarity of that problem with the KP.3-SAT is NPC, but I can't certainly say that it is similar to that "string packing" problem.
akappa
The "bag" is the string with the shortest length (the "optimally packed" one). Packing the goods into the bag is similar to adjusting the substrings in the "main" one: in both cases you have constraints (substring constraint or total weight limitation).
friol
IMHO the substring constraint makes the nature of the problem dramatically different, but nevermind ;)
akappa
+1  A: 

I did a lab back in college where we tasked with implementing a simple compression program.

What we did was sequentially apply these techniques to text:

  • BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
  • MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
  • Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols

Here, I found the assignment page.

To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.

Cory Larson
Interesting, but pretty much irrelevant to the question at hand. Also, it's usual to put a Run Length Encoding step in before the MTF. :)
Nick Johnson
+10  A: 

This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.

As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.

Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.

I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.

j_random_hacker
Thanks! Having a name for the problem is always a great start. I figured a perfect solution might be out of reach, but a good solution would be satisfying.
Adrian McCarthy
A: 

I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?

Here are a few good choices:

  • gzip for fast compression / decompression speed
  • bzip2 for a bit bitter compression but much slower decompression
  • LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
  • lzop for very fast compression / decompression

If you use Java, gzip is already integrated.

martinus
I'm not after packing, not compression. At run-time, I want the full text of each word readily accessible. I could do that without any sort of packing, but I recognized that packing could give me a significant reduction in footprint and improved locality of reference.
Adrian McCarthy
martinus
With compression, you have to decompress. With packing as I've described, there's no unpacking required. I have the full text of the original words directly available.
Adrian McCarthy
A: 

It's not clear what do you want to do.

Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?

Do you just want an array of words, compressed?

In the first case, you can go for a patricia trie or a String B-Tree.

For the second case, you can just adopt some index compression techinique, like that:

If you have something like:

aaa 
aaab
aasd
abaco
abad

You can compress like that:

0aaa
3b
2sd
1baco
2ad

The number is the length of the largest common prefix with the preceding string. You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

akappa
Note that, with the last schema, you should compress much more than a packing like you've suggested.Of course you can't just have one pointer to the word, but a tuple (pointer to the first word with 0 prefix, offset)
akappa
I'm not looking for a compression method. I need fast random-access to the full text of each word, so I don't want to decompress on the fly. Packing reduces the memory footprint and improves locality of reference.
Adrian McCarthy
Are you sure that it improves locality? Locality depends largely upon the order wich you request words, not only the memory footprint (except edge cases, of course).And are you really sure that it improves largely the memory footprint? It seems to me that this optimization can be a good thing if you have a particular set of strings, but it's pratically useless on, for ex., natural language words.
akappa
A: 

Refine step 3.

  • Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
  • If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
  • If no, add word to end of list as in current step 3.

This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).

Jonathan Leffler