views:

415

answers:

5

I have a large database for solving crossword puzzles, consisting of a word and a description. My application allows searching for words of a specific length and characters on specific positions (this is done the hard way ... go through all words and check each). Plus a search by description (if necessary)

For instance find word _ _ A _ _ B (6 letter word, third character A and last B)

I would like to index the words in such way that the searching would be really fast. My first idea was to use a balanced tree structure, any other suggestion?

+2  A: 

Since you use a database, create a Suffixes table.
For example :

  Suffix          |   WordID   | SN
  ----------------+------------+----   
  StackOverflow           10      1
  tackOverflow            10      2
  ackOverflow             10      3
  ckOverflow              10      4
  kOverflow               10      5
  ...

With that table it's easy to get all words that contain a particular char in a specific position,
like this:

SELECT WordID FROM suffixes
WHERE suffix >= 't' AND suffix < 'u' AND SN = 2

Get all words which contain 't' at position 2.

Update: if you want to save space, and sacrifice a bit of speed, you can use a suffix array.

You can store all the words in a line (array) with a separator among them, ie the $, and create a suffix array which will have pointers to chars. Now, given a char c you can find all instances of words which contain it rather fast. Still, you'll have to examine if it's in the right position.
(by checking how far it is from the $s)

Probably with the above technique the search will be x10 faster than searching all the words in your original program.

Update 2: I've used the database approach in one of my utilities where I needed to locate suffixes such as "ne", for example, and I forgot to adjust (optimize) it for this specific problem.

You can just store a single char as a suffix:

  Suffix   |   WordID   | SN
  ---------+------------+----   
  S                10      1
  t                10      2
  a                10      3
  c                10      4
  k                10      5
  ...

which saves a lot of space. Now, the query becomes

SELECT WordID FROM suffixes
WHERE suffix = 't' AND SN = 2
Nick D
Isn't this a *suffix* table?
KennyTM
@KennyTM, yes suffix, that's what I said :-)
Nick D
Would this require to add all suffixes into the database? If so the number of database entries would go way up. The factor of 10-15. Thus search would then be probably slower. Any suggestion on how to store and search such data, additionally to the database solution?
Drejc
@Drejc, yes, the suffixes will be stored in the database. IMO the factor will be 6-7. The `Suffix` should be a key, thus it'll have an index (even more space :), in that case the search will be fast. O(logn) if I'm not mistaken.
Nick D
@Drejc, If you want to store the data in the memory AND still want to exploit the power and easiness of a query language, I recommend SQLite which has the option to create a in-memory database.
Nick D
This is probably the way to go. Currently I'm looking on a set of around 150.000 words * 7 = 1.500.000 (but this will probably go up)
Drejc
@Drejc, 150.000 words * 7 = 1.050.000 :) Anyway, see my update.
Nick D
Ups typo ... yes 1.050.000 ;) will certainly give it a try and test out the one and the other.
Drejc
@Drejc, check out my second update.
Nick D
It's no longer a suffix :)
Matthieu M.
@Matthieu, yeah :) what's the proper english word, an infix?
Nick D
@Nick: being French... I am certainly not the one you ought to ask about vocabulary issues. I have enough difficulties as it is to make myself understood ;)
Matthieu M.
+1  A: 

You can use a Suffix Tree, or a Trie.

Skurmedel
This is probably the correct way to go. Any suggestion on how to store this data in order to have a fast search. Not necessary a database - it can be a flat file or similar.
Drejc
I came to the conclusion that your problem is a much harder problem than my small brain can solve easily :) The Suffix Tree works with individual strings, so I'm a bit miffed when it comes to using it for searching multiple words. It will certainly be very quick when searching an individual string for substrings but I don't quite see how to apply it to all words like Nick's solution does. Thus, maybe my answer is a non-answer; hopefully it helps somewhat.
Skurmedel
+2  A: 

This question: Good algorithm and data structure for looking up words with missing letters? started out exactly like the one you're asking, but then it was edited to something rather different and easier. Still, you can find some ideas there.

In short, everyone recommends loading the whole dictionary into memory and dividing the words into groups based on their length. From there, you can go many different directions. The more memory you are willing to use up, the faster you can go.

One nice suggestion is to keep a hash table of lists of words of a given length that have a given letter in a given position. You can build it like this (in Python):

# Build a whole lot of sorted word lists
wordlists = collections.defaultdict(list)
for word in sorted(all_words):
    for position, letter in enumerate(word):
        wordlists[len(word), position, letter].append(word)

Now if you need a 6-letter word ending in B, you can just ask for wordlists[6, 5, 'B'] and you've got the complete list. When you know more than one letter, as in ..A..B, you can pick whichever list is shortest and test each word against the desired pattern. My computer's dictionary only has 21 six-letter words ending with B, of which only SCARAB matches.

Jason Orendorff
Reading the linked question/answers regex might also provide the solution, in case words are separated by length in advance. It's hard to tell which is the right way to go without any testing and some measuring.
Drejc
+4  A: 

Okay, I am going to propose something weird, but coming from C++ I have been using Boost for a long time and I have come to see the MultiIndex library.

The idea of this library is to create one collection, but have many different ways to query it. It could model, in fact, a database.

So, let's put our words in a table, and put the necessary indexes in place:

word                     |length|c0|c1|c2| ... |c26|
-------------------------|------|--|--|--| ... |---|
Singapour                |9     |S |i |n | ... |0  |

Now the query will look like:

Select word From table Where length=9 And c2='n' And c8='u';

Easy enough isn't it ?

For maximum efficiency, the table should be partitioned on the length, and the indexes (one per cX column) should be local to the partition.

For an in-memory solution you would have one container per length, containing as many indexes as the length, each index being a hash table pointing to a sorted list (easier merge)

Here is a python description:

class Dictionary:
  def __init__(self, length):
    self.length = length
    self.words = set([])
    self.indexes = collections.defaultdict(set)

  def add(self, word):
    if len(word) != self.length:
      raise RuntimeException(word + ' is not ' + `self.length` + ' characters long')

    if word in self.words:
      raise RuntimeException(word + ' is already in the dictionary')

    self.words.add(word)

    for i in range(0,length):
      self.indexes[(i,word[i])].add(word)

  def search(self, list):
    """list: list of tuples (position,character)
    """
    def compare(lhs,rhs): return cmp(len(lhs),len(rhs))

    sets = [self.indexes[elem] for elem in list]
    sets.sort(compare)
    return reduce(intersection, sets)

I have voluntarily provided the length argument, to minimize the size of the hashes and thus make the search better. Also, the sets are sorted by length so that the computation of the intersection be better :)

Go ahead and test it against other solutions if you wish :)

Matthieu M.
+1  A: 

You could store your information in a trie of some sort (perhaps a ternary search tree). An algorithm for partial search using a trie is described in section 6 of this paper by Sedgewick and Bentley. You of course want to have different tries for the various lengths of words. The paper says that the partial search algorithm requires a time of O(n^((k-s)/k)) for s letters being specified in a trie of n k-length words.

Justin Peel