ansaurus

Question

Answer 1

+1 A:

I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).

Qubeuc 2009-05-10 13:28:00

I believe that only works with strings that start with common substrings. Strings that end with common substrings will not be recognized. Correct me if I'm wrong.

Zifre 2009-05-10 13:31:49

If strings end with a common substring, they wouldn't be matched up anyways based on this description. Doing so would cause the individual strings to become messed up.

Daniel Lew 2009-05-10 13:41:13

To elaborate, if you had "woman" and "lawman", you cant combine them even if you wanted to. The only way combination works (as I understand the problem) is if a suffix of one word matches a prefix of another.

Daniel Lew 2009-05-10 13:43:34

Answer 2

+1 A:

My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.

Konrad Rudolph 2009-05-10 13:31:58

What you are suggesting sounds like it could be implemented with a double radix tree (one forward and on backward). This would work in most cases, but if the strings have common parts in the middle, but not on the edges, it won't work.

Zifre 2009-05-10 13:34:51

For an example, it wouldn't recognize consuming and sum.

Zifre 2009-05-10 15:48:23

Answer 3

+1 A:

Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.

friol 2009-05-10 13:48:07

Could you just explain to us the link with the Knapsack Problem?

akappa 2009-05-10 15:14:27

The Knapsack problem (optimally packing some goods in a bag) looked similar to me. In fact (see j_random_hacker's answer) this is a NP-complete problem, like the Knapsack one.

friol 2009-05-10 15:27:57

Yes, but I still can't see the similarity of that problem with the KP.3-SAT is NPC, but I can't certainly say that it is similar to that "string packing" problem.

akappa 2009-05-10 15:34:06

The "bag" is the string with the shortest length (the "optimally packed" one). Packing the goods into the bag is similar to adjusting the substrings in the "main" one: in both cases you have constraints (substring constraint or total weight limitation).

friol 2009-05-10 15:42:32

IMHO the substring constraint makes the nature of the problem dramatically different, but nevermind ;)

akappa 2009-05-10 15:52:03

Answer 4

+1 A:

I did a lab back in college where we tasked with implementing a simple compression program.

What we did was sequentially apply these techniques to text:

BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols

Here, I found the assignment page.

To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.

Cory Larson 2009-05-10 14:05:11

Interesting, but pretty much irrelevant to the question at hand. Also, it's usual to put a Run Length Encoding step in before the MTF. :)

Nick Johnson 2009-05-10 15:58:34

Answer 5

+10 A:

This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.

As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.

Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.

I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.

j_random_hacker 2009-05-10 14:54:06

Thanks! Having a name for the problem is always a great start. I figured a perfect solution might be out of reach, but a good solution would be satisfying.

Adrian McCarthy 2009-05-10 15:52:10

Answer 6

A:

I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?

Here are a few good choices:

gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression

If you use Java, gzip is already integrated.

martinus 2009-05-10 15:10:25

I'm not after packing, not compression. At run-time, I want the full text of each word readily accessible. I could do that without any sort of packing, but I recognized that packing could give me a significant reduction in footprint and improved locality of reference.

Adrian McCarthy 2009-05-10 16:42:10

martinus 2009-05-11 11:42:34

With compression, you have to decompress. With packing as I've described, there's no unpacking required. I have the full text of the original words directly available.

Adrian McCarthy 2009-05-11 17:34:15

Answer 7

A:

It's not clear what do you want to do.

Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?

Do you just want an array of words, compressed?

In the first case, you can go for a patricia trie or a String B-Tree.

For the second case, you can just adopt some index compression techinique, like that:

If you have something like:

aaa 
aaab
aasd
abaco
abad

You can compress like that:

0aaa
3b
2sd
1baco
2ad

The number is the length of the largest common prefix with the preceding string. You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

akappa 2009-05-10 15:23:01

Note that, with the last schema, you should compress much more than a packing like you've suggested.Of course you can't just have one pointer to the word, but a tuple (pointer to the first word with 0 prefix, offset)

akappa 2009-05-10 15:36:55

I'm not looking for a compression method. I need fast random-access to the full text of each word, so I don't want to decompress on the fly. Packing reduces the memory footprint and improves locality of reference.

Adrian McCarthy 2009-05-10 16:44:36

Are you sure that it improves locality? Locality depends largely upon the order wich you request words, not only the memory footprint (except edge cases, of course).And are you really sure that it improves largely the memory footprint? It seems to me that this optimization can be a good thing if you have a particular set of strings, but it's pratically useless on, for ex., natural language words.

akappa 2009-05-10 18:12:58

Answer 8

A:

Refine step 3.

Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.

This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).

Jonathan Leffler 2009-05-10 15:45:40

ansaurus

tags:

views:

answers:

Text packing algorithm

related questions