ansaurus

Question

Algorithm for matching partially filled words

Answer 1

A:

test = '--a-';

for each (words as word)
{
    if ((word.length == test.length)
        && (test.index(0) == '-' || (word.index(0) == test.index(0)))
        && (test.index(1) == '-' || (word.index(1) == test.index(1)))
        && (test.index(2) == '-' || (word.index(2) == test.index(2)))
        && (test.index(3) == '-' || (word.index(3) == test.index(3))))
    {
        // match
    }
}

Is that what you need? Obviously it needs modified a little to work for different lengths.

Coronatus 2010-04-02 06:49:13

@Coronatus, checking in this manner will be inefficient. The dictionary that is going to be used is pretty big.

2010-04-02 06:52:25

Not really. First it filters the words to ones of the correct length (if different lengths, the other conditions are not checked). Besides, I don't see how the algorithm could be done differently.

Coronatus 2010-04-02 06:56:26

Answer 2

A:

Hi, From what I understood,can't you use a regex query? in the example above the pattern is something like ??a?

Then you need to loop through all the words and check if there is a match.

SysAdmin 2010-04-02 06:59:21

Indeed this would probably work. Though you would need something better than `?`, probably `[a-zA-Z]`, because you don't want to match anything, just a letter. The only remaining question is the speed of the operation.

Matthieu M. 2010-04-02 07:03:00

Answer 3

+3 A:

Well, it does not already exists, but it's been researched on SO already, for crosswords problems.

The gist of the solution I proposed was to index by letter and indexes, which is Python gives:

class Index:
  def __init__(self, list):
    self.data = defaultdict(set)
    for word in list: self.add(word)

  def add(self, word):
    for l in range(0, len(word)):
      self.data[(l, word[l])].insert(word)

  def look(self, letters):
    """letters is a list of tuples (position, letter)"""

    result = None
    for (p,l) in letters:
      set = self.data[(p,l)]
      if result == None: result = set
      else: result = result.insersection(set)

    return result

The idea is simple: you have a large index which has a set of words for each couple (position,letter). It could be extended, in your case, to have one index per word length, which would dimish the size of the sets of words and thus be faster.

For retrieval, you simply intersect all the sets to have the common set of word that matches all the known letters.

Matthieu M. 2010-04-02 07:01:35

That seems like a good idea..Thanks!

2010-04-02 07:09:51

Intersecting two result sets is nice in theory, and the language you're using might actually make it easy to do, but that doesn't mean it's effective. Unless the result sets are represented as some bit-masks that can be AND-ed, intersecting them would be a linear operation: you look at every item in set 1, see if it's also found in set 2, if it is move it to the result set. That's INEFICIENT! Since you're looking at every single result in the first result set, why not test directly if it conforms to the required mask and skip the second set?

Cosmin Prund 2010-04-02 08:55:25

Not a bad idea--easy to implement, and possibly fast enough. It's a lot more efficient at finding very many words given few constraints than all words that have just one letter missing.

Rex Kerr 2010-04-02 13:57:36

Python is quite good about sets :) First of all, sets are implemented using hashing, so retrieval is `O(1)`. Second, if you need performance, you can just iterate over the less populated set. For simplicity I merged in place here, but in the post I referred to about scrabble I listed all the sets, then sorted them by size and merged them iterating over the smaller one (getting smaller and smaller) which is more efficient of course. Here is the original algorithm: http://stackoverflow.com/questions/2288901/best-data-structure-for-crossword-puzzle-search/2296976#2296976

Matthieu M. 2010-04-02 17:30:01

Answer 4

+1 A:

another solution could be to structure your dictionary as a prefix tree. Then your algorithm will just have to go trough that tree. For each node you know which letter is associated and the position in the word so you know if it matches the letter you are looking for. If it doesn't you stop and don't go through its children. You also know when you go over the length of your query. Each leaf you reach can be added to the results list.

This solution may be quite efficient as far as the memory consumption is concerned.

PierrOz 2010-04-02 11:40:18

Answer 5

A:

If you're running on a reasonably powerful computer (compared to the load), then PierrOz has a good answer: store the dictionary as a prefix tree. Then you can do a breadth-first search, pruning only when you reach a level where you actually know the letter.

If you need an even faster solution, you need a way of constraining your search depth. One possibility is to bin the answers. For example, you can start by grouping the words by length; then you only have to look through lists of words of a certain length. Then you could subgroup by words containing specific letters--all letter pairs is probably enough. This gives you an array of something like 13000 elements that you can rapidly index into: count the number of letters in your word, then pick the rarest letter or two in the word and use that to index into a mini-prefix-tree that only holds words of that length with those letter(s). This strategy should get you down to a couple hundred words per bin in most cases, and the prefix tree search should be fast even if you pick out most of the breadth of the tree.

Rex Kerr 2010-04-02 13:53:48

ansaurus

tags:

views:

answers:

Algorithm for matching partially filled words

related questions