views:

381

answers:

3

Problem:

Given a list of strings, find the substring which, if subtracted from the beginning of all strings where it matches and replaced by an escape byte, gives the shortest total length.

Example:

"foo", "fool", "bar"

The result is: "foo" as the base string with the strings "\0", "\0l", "bar" and a total length of 9 bytes. "\0" is the escape byte. The sum of the length of the original strings is 10, so in this case we only saved one byte.

A naive algorithm would look like:

for string in list
  for i = 1, i < length of string
      calculate total length based on prefix of string[0..i]
      if better than last best, save it
return the best prefix

That will give us the answer, but it's something like O((n*m)^2), which is too expensive.

+1  A: 

I would try starting by sorting the list. Then you simply go from string to string comparing the first character to the next string's first char. Once you have a match you would look at the next char. You would need to devise a way to track the best result so far.

EBGreen
With that approach, can you guarantee that you will have an optimal solution? If you always pick the char which gives you most strings with the same prefix, you end up with the longest common prefix, and that might not be what gives the best compression.
Markus Johansson
That would rely on the part about "You would need to devise a way to track the best result so far."
EBGreen
+6  A: 

Use a forest of prefix trees (trie)...

  f_2    b_1
 /       |
 o_2     a_1
 |       |
 o_2     r_1
 |
 l_1

then, we can find the best result, and guarantee it, by maximizing (depth * frequency) which will be replaced with your escape character. You can optimize the search by doing a branch and bound depth first search for the maximum.

On the complexity: O(C), as mentioned in comment, for building it, and for finding the optimal, it depends. If you order the first elements frequency (O(A) --where A is the size of the languages alphabet), then you'll be able to cut out more branches, and have a good chance of getting sub-linear time.

I think this is clear, I am not going to write it up --what is this a homework assignment? ;)

nlucaroni
Sounds good, though I think you'd want ((depth - 1) * frequency), assuming the size of the replacement is equal to that of one character (though question says one byte). Should run in O(c) where c is the total number of characters.
Dave L.
The first part is basically building a trie from a list of strings, by the way.
Tyler
Haha, no it's not a homework assignment. I'm far too old for that. =) I actually have a fairly good, working implementation, but it's not guaranteed to give an optimal result. Nice idea with a tree.
Markus Johansson
+1  A: 

Well, first step would be to sort the list. Then one pass through the list, comparing each element with the previous, keeping track of the longest 2-character, 3-character, 4-character etc runs. Then figure is the 20 3-character prefixes better than the 15 4-character prefixes.

James Curran