views:

90

answers:

5

The Problem: A large static list of strings is provided. A pattern string comprised of data and wildcard elements (* and ?). The idea is to return all the strings that match the pattern - simple enough.

Current Solution: I'm currently using a linear approach of scanning the large list and globbing each entry against the pattern.

My Question: Are there any suitable data structures that I can store the large list into such that the search's complexity is less than O(n)?

Perhaps something akin to a suffix-trie? I've also considered using bi- and tri-grams in a hashtable, but the logic required in evaluating a match based on a merge of the list of words returned and the pattern is a nightmare, furthermore I'm not convinced its the correct approach.

A: 

If you don't care about memory and you can afford to pre-process the list, create a sorted array of every suffix, pointing to the original word, e.g., for ['hello', 'world'], store this:

[('d'    , 'world'),
 ('ello' , 'hello'),
 ('hello', 'hello'),
 ('ld'   , 'world'),
 ('llo'  , 'hello'),
 ('lo'   , 'hello'),
 ('o'    , 'hello'),
 ('orld' , 'world'),
 ('rld'  , 'world'),
 ('world', 'world')]

Use this array to build sets of candidate matches using pieces of the pattern.

For instance, if the pattern is *or*, find the candidate match ('orld' , 'world') using a binary chop on the substring or, then confirm the match using a normal globbing approach.

If the wildcard is more complex, e.g., h*o, built sets of candidates for h and o and find their intersection before the final linear glob.

Marcelo Cantos
@Marcelo: Lets assume if i were to concatenate all the strings in the static list into 1 string of length n, your saying to then consume n*n-1 space (not assuming the over head of each list element), to do what exactly? I still have to figure out how to match * sequence over possible strings like "aa" "aba" "aaba" "aaaba" or a given pattern such as "a*a" not a viable solution even if memory were not an issue. furthermore the patterns will be arbitary, i can't be building sets of suffixes to accomodate every possible pattern.
Monomer
You don't store every conceivable suffix, you take the suffixes from your static list.
Marcelo Cantos
A: 

you could build a regular trie and add wildcard edges. then your complexity would be O(n) where n is the length of the pattern. You would have to replace runs of ** with * in the pattern first (also an O(n) operation).

If the list of words were I am an ox then the trie would look a bit like this:

  (I ($ [I])
   a (m ($ [am])
      n ($ [an])
      ? ($ [am an])
      * ($ [am an]))
   o (x ($ [ox])
      ? ($ [ox])
      * ($ [ox]))
   ? ($ [I]
      m ($ [am])
      n ($ [an])
      x ($ [ox])
      ? ($ [am an ox])
      * ($ [I am an ox]
         m ($ [am]) ...)
   * ($ [I am an ox]
      I ...
    ...

And here is a sample python program:

import sys

def addWord(root, word):
    add(root, word, word, '')

def add(root, word, tail, prev):
    if tail == '':
        addLeaf(root, word)
    else:
        head = tail[0]
        tail2 = tail[1:]
        add(addEdge(root, head), word, tail2, head)
        add(addEdge(root, '?'), word, tail2, head)
    if prev != '*':
        for l in range(len(tail)+1):
           add(addEdge(root, '*'), word, tail[l:], '*')

def addEdge(root, char):
    if not root.has_key(char):
        root[char] = {}
    return root[char]

def addLeaf(root, word):
    if not root.has_key('$'):
        root['$'] = []
    leaf = root['$']
    if word not in leaf:
        leaf.append(word)

def findWord(root, pattern):
    prev = ''
    for p in pattern:
        if p == '*' and prev == '*':
            continue
        prev = p
        if not root.has_key(p):
            return []
        root = root[p]
    if not root.has_key('$'):
        return []
    return root['$']

def run():
    print("Enter words, one per line terminate with a . on a line")
    root = {}
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        addWord(root, line)
    print(repr(root))
    print("Now enter search patterns. Do not use multiple sequential '*'s")
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        print(findWord(root, line))

run()
David Leonard
@David: what is a wildcard edge? could you please elaborate.
Monomer
@Monomer a wildcard character from the pattern. The idea being you build a tree that answers all valid patterns.
David Leonard
+4  A: 

I agree that a suffix trie is a good idea to try, except that the sheer size of your dataset might make it's construction use up just as much time as its usage would save. Theyre best if youve got to query them multiple times to amortize the construction cost. Perhaps a few hundred queries.

Also note that this is a good excuse for parallelism. Cut the list in two and give it to two different processors and have your job done twice as fast.

Karl
@Karl: I have to agree with you on the trie problem. It does get quite large when constructing the list at hand. I've thought about using parallelism however that doesn't give me much when considering the numer of times the search is to occur per second. I was hoping there would be a hashtable like data structure that just magically brings everything down to O(1) or some such....
Monomer
Unfortunately you can't have O(1) and wildcards. Maybe if the problem were rephrased to work on the word level instead of character-wise you could cut the search space.
Karl
+3  A: 

You say you're currently doing linear search. Does this give you any data on the most frequently performed query patterns? e.g. is blah* much more common than bl?h (which i'd assume it was) among your current users?

With that kind of prior knowledge you can focus your indexing efforts on the commonly used cases and get them down to O(1), rather than trying to solve the much more difficult, and yet much less worthwhile, problem of making every possible query equally fast.

Daniel Earwicker
@Daniel: I've tried performing statistics on the types of patterns being queried, there aren't any obvious winners. One thing that i did not mention in the original question is that the strings in the static list have a maximum size, and that the average is rougly half the maxium with a stdev of about 1/4 of the average. Not sure if that provides any insight into the problem.
Monomer
So you wouldn't even say that using one wildcard is a lot more common than using five wildcards?
Daniel Earwicker
A: 

You can achieve a simple speedup by keeping counts of the characters in your strings. A string with no bs or a single b can never match the query abba*, so there is no point in testing it. This works much better on whole words, if your strings are made of those, since there are many more words than characters; plus, there are plenty of libraries that can build the indexes for you. On the other hand, it is very similar to the n-gram approach you mentioned.

If you do not use a library that does it for you, you can optimize queries by looking up the most globally infrequent characters (or words, or n-grams) first in your indexes. This allows you to discard more non-matching strings up front.

In general, all speedups will be based on the idea of discarding things that cannot possibly match. What and how much to index depends on your data. For example, if the typical pattern length is near to the string length, you can simply check to see if the string is long enough to hold the pattern.

tucuxi