views:

427

answers:

9

Hello, everybody!

I am implementing a text-based version of Scrabble for a College project.

My dictionary is quite large, weighing in at around 400.000 words (std::string).

Searching for a valid word will suck, big time, in terms of efficiency if I go for a vector<string> ( O(n) ). Are there any good alternatives? Keep in mind, I'm enrolled in freshman year. Nothing TOO complex!

Thanks for your time!

Francisco

+22  A: 

If you wanted to go for something that is in the standard library, you could use std::set with the word as the key. That would give you logarithmic search time.

Since your dictionary is presumably static (i.e. created once and not modified), you could also use a std::vector, sort it using std::sort, and then use std::binary_search on the sorted vector to find a word. This would also give logarithmic search time, and would possibly be more space efficient than a set.

If you want to implement your own data structure, a trie would be a good choice.

James McNellis
Can you get me more details on the trie? Do you recommend I just search the web? I'm a beginner, mind you :P
Francisco P.
@Francisco: The Wikipedia article (http://en.wikipedia.org/wiki/Trie) is a decent description; a Google search would give lots of results, and probably a lot of sample code, on building and using Tries. There are quite a few questions here on StackOverflow about Tries as well.
James McNellis
Thanks a bunch.
Francisco P.
http://sourceforge.net/projects/radixtree/ - here's a great tire implementation.
the_drow
+1 Although I would change "would possibly" to "will", there's no overhead in a sorted `vector` but it's impossible to implement `set` without some kind of overhead. Also a sorted `vector` is more cache friendly.
Andreas Brinck
@Andreas A binary search through a sorted vector is not exactly cache friendly :)
FredOverflow
The keyword is "more" ;)
Andreas Brinck
+6  A: 

std::set is a natural for this because it takes almost 0 work on top of using a vector. Even so, I will teach you something you don't usually learn until your are a professional. Don't optimized prematurely. I bet on a moden computer that a linear dictionary lookup in a 40K string vector takes .001 seconds.

In a set, it's O(log n) and probably takes .00001 seconds.

Anything not in the STL is a total waste of time. Don't spend $10 of work on a 10 cent problem.

frankc
When you can choose between a `std::set`, a `std::vector`, and a `std::tr1::unsorted_set`, picking the fastest one for your purpose isn't premature optimization.
Ken Bloom
@Yacoby: Java's a completely different runtime, with completely different kinds overhead (e.g. interpreter time, JIT time, any garbage collection). It's not comparable to C++ in any way, and may be off by an order of magnitude compared to C++ (for the same number of strings), depending on how the benchmark is set up.
Ken Bloom
@Ken that is why I recommend a set in the post, and the STL in general...
frankc
@user275455 A sorted `vector` will be vastly faster than a `set`.
Andreas Brinck
+2  A: 

If the vector is sorted, you can use binary_search to test if a given word exists in the dictionary.

UncleBens
+6  A: 

A trie or radix tree would be give you search times and insertion times that are linear in the length of the string you are searching for.

(Note that linear in the length of the string you are searching for is the best you can do with any search algorithm, because comparing or hashing strings is linear in the length of the string -- therefore the component of running time that's linear in the length of the string is usually left out of the running times for binary search, binary trees, or linear search.)

These solutions are probably overkill if you don't have them in your library already.

Ken Bloom
+1  A: 

Use a std::tr1::unordered_set, and that gives you constant time lookup. (Linear in the length of the string, as per my other answer.)

Ken Bloom
+3  A: 

What is the purpose of your data structure? What sort of things do you want to do with it?

  • Check if a complete word like "tempest" is in the list?
  • Find all seven letter words starting and ending with "t"?
  • Find all seven letter words starting and ending with "t" that can be made with a given set of letters?

The first question is all you need if you're implementing some sort of referee for a group of human players, that is, some entity to check proposed words against the official dictionary. Basic data structures like std::map, std::set, std::vector, already suggested by others, by themselves suffice for this purpose.

The second and third questions are the kind you need answered if you're writing a player. Here you might want to consider 26 sets for each letter position, each set holding the words with a given letter in a given position. You'll need extra code to compute intersections when needed, and maybe check words against the letters available on your rack.

Update: In a comment on the original question, the OP made it clear that he only has to check if a word is in the dictionary. This is the first question I asked above, and any of the standard efficient data structures is fine.

Dale Hagglund
+4  A: 

I did some profiling and got the following results (MSVS 2008, /O2, release build, launching .exe separately).

EDIT - Now I realize I actually f*cked up my first test, because I didn't split the building and searching test. Though it didn't change the "winners", I made some new tests. So, here are the results when we split them.

First, if there are almost no bad search requests (4 million good search attempts).

[ RUN      ] Containers.DictionaryPrepare
[       OK ] Containers.DictionaryPrepare (234 ms)
[ RUN      ] Containers.VectorPrepare
[       OK ] Containers.VectorPrepare (704 ms)
[ RUN      ] Containers.SetPrepare
[       OK ] Containers.SetPrepare (593 ms)
[ RUN      ] Containers.MultisetPrepare
[       OK ] Containers.MultisetPrepare (578 ms)
[ RUN      ] Containers.UnorderedSetPrepare
[       OK ] Containers.UnorderedSetPrepare (266 ms)
[ RUN      ] Containers.UnorderedMultisetPrepare
[       OK ] Containers.UnorderedMultisetPrepare (375 ms)
[ RUN      ] Containers.VectorSearch
[       OK ] Containers.VectorSearch (4484 ms)
[ RUN      ] Containers.SetSearch
[       OK ] Containers.SetSearch (5469 ms)
[ RUN      ] Containers.MultisetSearch
[       OK ] Containers.MultisetSearch (5485 ms)
[ RUN      ] Containers.UnorderedSet
[       OK ] Containers.UnorderedSet (1078 ms)
[ RUN      ] Containers.UnorderedMultiset
[       OK ] Containers.UnorderedMultiset (1250 ms)
[----------] 11 tests from Containers (20516 ms total)

This profiling shows that you should use "normal" container variations instead of "multi", and you should choose the unordered_set. It's great in building time and search operations time.

Here are the results for another case (guess, it's not about your app, but just for it to be), when the amount of bad searches equals the amount of good searches (and equals 2 million). The winner remains the same.

Also note, that for static dictionaries vector performs better (though need more time for initialization), than set, but well, it will suck if you have to add elements.

[ RUN      ] Containers.DictionaryPrepare
[       OK ] Containers.DictionaryPrepare (235 ms)
[ RUN      ] Containers.VectorPrepare
[       OK ] Containers.VectorPrepare (718 ms)
[ RUN      ] Containers.SetPrepare
[       OK ] Containers.SetPrepare (578 ms)
[ RUN      ] Containers.MultisetPrepare
[       OK ] Containers.MultisetPrepare (579 ms)
[ RUN      ] Containers.UnorderedSetPrepare
[       OK ] Containers.UnorderedSetPrepare (265 ms)
[ RUN      ] Containers.UnorderedMultisetPrepare
[       OK ] Containers.UnorderedMultisetPrepare (375 ms)
[ RUN      ] Containers.VectorSearch
[       OK ] Containers.VectorSearch (3375 ms)
[ RUN      ] Containers.SetSearch
[       OK ] Containers.SetSearch (3656 ms)
[ RUN      ] Containers.MultisetSearch
[       OK ] Containers.MultisetSearch (3766 ms)
[ RUN      ] Containers.UnorderedSet
[       OK ] Containers.UnorderedSet (875 ms)
[ RUN      ] Containers.UnorderedMultiset
[       OK ] Containers.UnorderedMultiset (1016 ms)
[----------] 11 tests from Containers (15438 ms total)

Testing code:

TEST(Containers, DictionaryPrepare) {
   EXPECT_FALSE(strings_initialized);
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      strings.push_back(generate_string());
   }
}

TEST(Containers, VectorPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      vec.push_back(strings[i]);
   }
   sort(vec.begin(), vec.end());
}

TEST(Containers, SetPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      set.insert(strings[i]);
   }
}

TEST(Containers, MultisetPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      multiset.insert(strings[i]);
   }
}

TEST(Containers, UnorderedSetPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      uo_set.insert(strings[i]);
   }
}

TEST(Containers, UnorderedMultisetPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      uo_multiset.insert(strings[i]);
   }
}

TEST(Containers, VectorSearch) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      std::binary_search(vec.begin(), vec.end(), strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      std::binary_search(vec.begin(), vec.end(), NONEXISTENT_ELEMENT);
   }
}

TEST(Containers, SetSearch) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      set.find(strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      set.find(NONEXISTENT_ELEMENT);
   }
}

TEST(Containers, MultisetSearch) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      multiset.find(strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      multiset.find(NONEXISTENT_ELEMENT);
   }
}

TEST(Containers, UnorderedSet) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      uo_set.find(strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      uo_set.find(NONEXISTENT_ELEMENT);
   }
}

TEST(Containers, UnorderedMultiset) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      uo_multiset.find(strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      uo_multiset.find(NONEXISTENT_ELEMENT);
   }
}
Kotti
You shouldn't test container construction time. It's only done once, while look-ups are the important thing.
GMan
WOW! Thanks for your throughness.
Francisco P.
Re-made my tests considering the split idea. Thank you.
Kotti
+1  A: 

I suggest using the length of the word and the first letters as the first two items in the search. Let the data be organized to support this algorithm.

First, let us define a container for all the words that are the same length:

typedef std::vector<std::string> Word_Container;

This data structure shall be sorted, so that a binary search can be used.

Next, an index table will be created. The index table will be of the form <word length, pointer to word container>:

typedef std::map<unsigned int, Word_Container *> Index_Table;

And finally have an array of index tables, using the first letter of the word as an index:

Index_Table alpha_array[26]; // ASCII A - Z.

The algorithm is then:
Calculate index into alpha_array:
index = word[0] - 'A';

Use index to get associated index table:
Index_Table& table = alpha_array[index];

Use length of the word as the key for the table lookup, to get word container:
Word_Container * p_word_container = table[word.length()];

Search container for exact word:

bool found = false;
if (p_word_container)
{
    found = std::binary_search(p_word_container->begin(), p_word_container->end(), word);
}

There are more efficient, but complex, methods for searching a dictionary for a word. The above algorithm has a benefit of a quick "escape" points where the word doesn't exist. This concept can be extended to database tables.

Thomas Matthews
A: 

I would echo Ken's suggestion of using a Trie, and further suggest that you could dramatically shrink the size of the trie by letting it be more like a finite-state-machine for common prefixes and suffixes. For example
"nation",
"national",
"nationalize",
"nationalization",
"nationalizing",
"denationalize",
could all share common structure. You would have to be concerned about dubious words, like
"nationalizationalizationalizing"
I used this ages ago in a spelling correction program.

Mike Dunlavey