ansaurus

Question

Find the prefix substring which gives best compression

Answer 1

+1 A:

I would try starting by sorting the list. Then you simply go from string to string comparing the first character to the next string's first char. Once you have a match you would look at the next char. You would need to devise a way to track the best result so far.

EBGreen 2008-09-29 21:15:36

With that approach, can you guarantee that you will have an optimal solution? If you always pick the char which gives you most strings with the same prefix, you end up with the longest common prefix, and that might not be what gives the best compression.

Markus Johansson 2008-09-29 21:24:00

That would rely on the part about "You would need to devise a way to track the best result so far."

EBGreen 2008-09-29 23:42:23

Answer 2

+6 A:

Use a forest of prefix trees (trie)...

  f_2    b_1
 /       |
 o_2     a_1
 |       |
 o_2     r_1
 |
 l_1

then, we can find the best result, and guarantee it, by maximizing (depth * frequency) which will be replaced with your escape character. You can optimize the search by doing a branch and bound depth first search for the maximum.

On the complexity: O(C), as mentioned in comment, for building it, and for finding the optimal, it depends. If you order the first elements frequency (O(A) --where A is the size of the languages alphabet), then you'll be able to cut out more branches, and have a good chance of getting sub-linear time.

I think this is clear, I am not going to write it up --what is this a homework assignment? ;)

nlucaroni 2008-09-29 21:19:26

Sounds good, though I think you'd want ((depth - 1) * frequency), assuming the size of the replacement is equal to that of one character (though question says one byte). Should run in O(c) where c is the total number of characters.

Dave L. 2008-09-29 23:12:38

The first part is basically building a trie from a list of strings, by the way.

Tyler 2008-09-29 23:35:23

Haha, no it's not a homework assignment. I'm far too old for that. =) I actually have a fairly good, working implementation, but it's not guaranteed to give an optimal result. Nice idea with a tree.

Markus Johansson 2008-09-30 05:58:27

Answer 3

+1 A:

Well, first step would be to sort the list. Then one pass through the list, comparing each element with the previous, keeping track of the longest 2-character, 3-character, 4-character etc runs. Then figure is the 20 3-character prefixes better than the 15 4-character prefixes.

James Curran 2008-09-29 21:19:28

ansaurus

tags:

views:

answers:

Find the prefix substring which gives best compression

related questions