ansaurus

Question

C++ - How to efficiently find out if any string in a vector can be assembled from a set of letters

Answer 1

+7 A:

Without giving you any specific code (since this is homework after all), one general approach to consider is to map from the sorted letters in the word to the actual legal words.

That is to say, if your dictionary file had only the words ape, gum, and mug, your data structure would look like:

aep -> ape
gmu -> gum, mug

Then you can simply go through permutations of the player's letters, and quickly identify if that key exists in the map.

You pay a little bit of processing time setting up the dictionary at startup, but then you only have to perform a few quick lookups rather than iterating through the whole list every time.

Mark Rushakoff 2010-05-14 16:43:42

That's exactly what I was in the process of typing up.

Kaleb Pederson 2010-05-14 16:45:54

This is how Jon Bentley describes his anagram detection/creation algorithm in "Programming Pearls".It's also wrong: it would only identify words which can be produced with *all* the player's letters.

jemfinch 2010-05-14 16:53:46

@jemfinch: Precisely.

Francisco P. 2010-05-14 16:58:26

@jemfinch: You are correct to say that this does not allow a single lookup to determine all anagrams from all subsets of the player's letters, but I did specify in my answer that you would need to perform multiple lookups.

Mark Rushakoff 2010-05-14 16:59:37

"Then you can simply go through permutations of the player's letters" - I'm sorry, I'm a beginner, how exactly would I generate these permutations?

Francisco P. 2010-05-14 17:04:22

@Mark I missed that you would do multiple lookups. The problem then is that you'd still need to do 127 lookups; not exactly efficient.

jemfinch 2010-05-14 17:12:51

Answer 2

+1 A:

You could also store the strings with characters sorted in ASCIIbetical order into a std::set, then sort the player's letters into the same order and search the map for each substring of the player's letters.

computergeek6 2010-05-14 16:45:42

Answer 3

+1 A:

How about keeping the pairs {word from the dictionary, string consisting of the same letters but in ascending order (sorted)}

Then sort the vector of those pairs based on the second string and compare using binary search with a string consisting of sorted letters from players hand.

Maciej Hehl 2010-05-14 16:46:53

Answer 4

+2 A:

Sounds like a variation of the subset sum problem: http://en.wikipedia.org/wiki/Subset_sum_problem

Maybe some of the described algorithms would help you.

Noah Roberts 2010-05-14 17:02:01

Answer 5

+2 A:

There have been numerous papers and questions on Scrabble on this site.

There are many strategies available too.

The representation of your dictionary is inadequate, there are much clever methods available. For example, check what a Trie is on wikipedia.

Using this you can implement a backtracking algorithm to quickly determine which words you can form.

{'as', 'ape', 'gum'}

Trie:

void -a-> (n) -p-> (n) -e-> (y)
              -s-> (y)
     -g-> (n) -u-> (n) -m-> (y)

Where 'n' means that it does not form a word and y means that it does.

Now, you just have to walk the Trie, keeping in mind what letters are available.

Say that you have {'a', 'p', 'g', 'm', 'u'}:

 1. I have a 'a' (but 'a' is not a word)
 2. I have a 'p' (but 'ap' is not a word)
 3. I don't have any 'e' so I can't go further, let's backtrack
 4. I don't have any 's' so...
 5. I have a 'g', but it's not a word
 6. I have a 'u', but 'gu' is not a word
 7. I have a 'm' and 'gum' is a word, I store it somewhere, I can't go further

The point is to maintain a set of the available letters, when you take the -a-> branch, you remove 'a' from this set, then when you take -a-> in reverse (while backtracking) you add it back in the set.

This structure is much more space efficient, it actually models a Finite Automaton which recognize the language of your dictionary instead of blindly saving all words
The runtime should be much faster as well, since you'll never go deep in the tree-structure (you only have 7 letters available)
It's certainly not what I'd do, since it does not take the board into account :p

' ' letters mean you can take any of the available branches. You don't need to use a blank if you have the required letter.

Matthieu M. 2010-05-14 17:02:28

Thank you for your thorough answer. At the beginning of my project, I thought of a Trie, yet I wanted to avoid implementing such a complicated data structure. I found a good implementation of a radix tree online, and got an "all-clear" from my instructor to use it. Do you think that would cut it?

Francisco P. 2010-05-14 17:10:37

The radix tree is a "space-efficient" trie, the principle is the same so it will definitely work. However the main issue with your logic: just trying to form words with the letters in your possession is insufficient ;) try search for scrabble on SO if you want more clues.

Matthieu M. 2010-05-14 17:36:16

Answer 6

+1 A:

There are some good answers here already, and I think a trie is probably the right way to go, but this is an interesting problem so I'll toss in my two cents' worth...

The naive approach would be to generate all permutations of the available letters and of all distinct subsets, then search for each potential word in the dictionary. The problem is that, while it's not hard to do this, there is a surprisingly large number of potential words, and most of them are invalid.

On the positive side, checking the dictionary can be sped up with a binary search or something similar. On the negative side, you'd be doing this so many times that the program would grind to a halt for long lists of letters.

We definitely need to preprocess the dictionary to make it more useful, and what we really need is to have a way to quickly rule out most of the potential matches, even if the method has occasional false positives.

One way to do this would be to represent which letters a word uses in a bit map. In other words, precalculate a 32-bit number for each word in the dictionary, where each bit is set if the corresponding letter of the alphabet is used in the word at least once. This would allow you to find all potential words by doing a linear scan of the dictionary and keeping only the ones that use only letters you have available. I suspect that, with a bit of cleverness and indexing, you can do better than linear.

Of the candidates you find, some will require more instances of a letter than you have available, so these will be false positives. That means you need to do a final check on all of the candidates you generated to eliminate the almost-hits. There are many ways to do this, but one of the simplest is to go through your list of letters and replace the first occurrence of that letter in the potential word with a dash. When you're done, if the potential word has anything but dashes, it's a failure. A more elegant solution, though not necessarily faster, would be to generate an array of letter frequencies and compare them.

Again, I think tries are probably the way to go, but I hope these ideas are useful to you.

edit

Let me toss out an example of how you could do better than a full linear search on the initial search: use the radix. Keep a simple index that lets you look up the first word that starts with a given letter. Then, when doing the search, skip over all words that start with a letter that you don't have. This is not a gigantic speedup, but it's an improvement.

Steven Sudit 2010-05-19 16:14:43

I'm not going to edit further, but I feel obligated to mention that Bloom filters would be a great way to check any list of potential words against the dictionary, in that they're fast and do not allow false negatives.

Steven Sudit 2010-05-25 15:39:57

ansaurus

tags:

views:

answers:

C++ - How to efficiently find out if any string in a vector can be assembled from a set of letters

related questions