views:

370

answers:

6

Hi there,

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).

So if the user entered:

"orange" it should match an entry "orange' in the dictionary.

Now the catch is that the user can also enter a wildcard or series of wildcard characters like say

"or__ge" which would also match "orange"

The key requirements are:

* this should be as fast as possible.

* use the smallest amount of memory to achieve it.  

If the size of the word list was small I could use a string containing all the words and use regular expressions.

however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.

So is some sort of 'tree' be the way to go for this...?

Any thoughts or suggestions on this would be totally appreciated!

Thanks in advance, Matt

+1  A: 

I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)

However if that wasn't good enough I would probably use a prefix tree for this.

The basic structure is a tree where:

  • The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
  • The next level down contains all the possible second letters for each given first letter
  • And so on until you reach an "end of word" marker for each word

Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.

In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.

mikera
+1  A: 

No matter which algorithm you choose, you have a tradeoff between speed and memory consumption.

If you can afford ~ O(N*L) memory (where N is the size of your dictionary and L is the average length of a word), you can try this very fast algorithm. For simplicity, will assume latin alphabet with 26 letters and MAX_LEN as the max length of word.

Create a 2D array of sets of integers, set<int> table[26][MAX_LEN].

For each word in you dictionary, add the word index to the sets in the positions corresponding to each of the letters of the word. For example, if "orange" is the 12345-th word in the dictionary, you add 12345 to the sets corresponding to [o][0], [r][1], [a][2], [n][3], [g][4], [e][5].

Then, to retrieve words corresponding to "or..ge", you find the intersection of the sets at [o][0], [r][1], [g][4], [e][5].

Igor Krivokon
A: 

You can try a string-matrix:

0,1: A
1,5: APPLE
2,5: AXELS
3,5: EAGLE
4,5: HELLO
5,5: WORLD
6,6: ORANGE
7,8: LONGWORD
8,13:SUPERLONGWORD

Let's call this a ragged index-matrix, to spare some memory. Order it on length, and then on alphabetical order. To address a character I use the notation x,y:z: x is the index, y is the length of the entry, z is the position. The length of your string is f and g is the number of entries in the dictionary.

  • Create list m, which contains potential match indexes x.
  • Iterate on z from 0 to f.
    • Is it a wildcard and not the latest character of the search string?
      • Continue loop (all match).
    • Is m empty?
      • Search through all x from 0 to g for y that matches length. !!A!!
        • Does the z character matches with search string at that z? Save x in m.
      • Is m empty? Break loop (no match).
    • Is m not empty?
      • Search through all elements of m. !!B!!
        • Does not match with search? Remove from m.
      • Is m empty? Break loop (no match).

A wildcard will always pass the "Match with search string?". And m is equally ordered as the matrix.

!!A!!: Binary search on length of the search string. O(log n)
!!B!!: Binary search on alphabetical ordering. O(log n)

The reason for using a string-matrix is that you already store the length of each string (because it makes it search faster), but it also gives you the length of each entry (assuming other constant fields), such that you can easily find the next entry in the matrix, for fast iterating. Ordering the matrix isn't a problem: since this has only be done once the dictionary updates, and not during search-time.

Pindatjuh
A: 

If you are allowed to ignore case, which I assume, then make all the words in your dictionary and all the search terms the same case before anything else. Upper or lower case makes no difference. If you have some words that are case sensitive and others that are not, break the words into two groups and search each separately.

You are only matching words, so you can break the dictionary into an array of strings. Since you are only doing an exact match against a known length, break the word array into a separate array for each word length. So byLength[3] is the array off all words with length 3. Each word array should be sorted.

Now you have an array of words and a word with potential wild cards to find. Depending on wether and where the wildcards are, there are a few approaches.

If the search term has no wild cards, then do a binary search in your sorted array. You could do a hash at this point, which would be faster but not much. If the vast majority of your search terms have no wildcards, then consider a hash table or an associative array keyed by hash.

If the search term has wildcards after some literal characters, then do a binary search in the sorted array to find an upper and lower bound, then do a linear search in that bound. If the wildcards are all trailing then finding a non empty range is sufficient.

If the search term starts with wild cards, then the sorted array is no help and you would need to do a linear search unless you keep a copy of the array sorted by backwards strings. If you make such an array, then choose it any time there are more trailing than leading literals. If you do not allow leading wildcards then there is no need.

If the search term both starts and ends with wildcards, then you are stuck with a linear search within the words with equal length.

So an array of arrays of strings. Each array of strings is sorted, and contains strings of equal length. Optionally duplicate the whole structure with the sorting based on backwards strings for the case of leading wildcards.

The overall space is one or two pointers per word, plus the words. You should be able to store all the words in a single buffer if your language permits. Of course, if your language does not permit, grep is probably faster anyway. For a million words, that is 4-16MB for the arrays and similar for the actual words.

For a search term with no wildcards, performance would be very good. With wildcards, there will occasionally be linear searches across large groups of words. With the breakdown by length and a single leading character, you should never need to search more than a few percent of the total dictionary even in the worst case. Comparing only whole words of known length will always be faster than generic string matching.

drawnonward
"If the search term both starts and ends with wildcards, then you are stuck with a linear search within the words with equal length." Check out my answer: I skip the wildcards only if it's not the latest in the search string (in case of a full wildcards only search, which is linear), which forces it to make use of the binary search, no matter if it's wildcarded.
Pindatjuh
+6  A: 
Norman Ramsey
Since I don't have access to the publications, I'm wondering one thing: do they build one DAWG for each different length or not ? I think it could considerably speed up the search, since in this case we know beforehand how many letters the word we seek has.
Matthieu M.
@Matthieu: Google will get you the paper, but I've also added a (possibly ephemeral) link. As for one DAWG per length, you can do this, but it's a time-space tradeoff. The DAWG will store a long word list very effectively with lots of sharing. With one DAWG per length you will lose that sharing. As for speedup it's an experimental question, and experiments may come out differently depending on the machine's cache.
Norman Ramsey
A: 

Try to build a Generalized Suffix Tree if the dictionary will be matched by sequence of queries. There is linear time algorithm that can be used to build such tree (Ukkonen Suffix Tree Construction).

You can easily match (it's O(k), where k is the size of the query) each query by traversing from the root node, and use the wildcard character to match any character like typical pattern finding in suffix tree.

monn