views:

357

answers:

2

Hi

I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking to this, but search performance is not great on this data volume (around 5 minutes).

I basically have two questions: (1) Can you suggest any ways to improve performance to avoid a table scan (i.e. having to count common q-grams between the search string and 3 million surnames) (2) With q-grams, if A is similar to B and C is similar to B, does it imply C is similar to A?

Kind regards

Peter

+1  A: 

I've been looking into fuzzy string matching lately, so even at the risk of answering to an abandoned question, here goes. Hope you find this useful.

I suppose you're only interested in the strings for which the edit distance is smaller than a given value. And your q-grams (or n-grams) look like this

2-grams for "foobar": {"fo","oo","ob","ba","ar"}
  1. You could use positional q-grams:

    "foobar": {("fo",1),("oo",2),("ob",3),("ba",4),("ar",5)}
    

    The positional information can be used to determine if a matching q-gram is really a "good match".

    For example, if you're searching for "foobar" with maximum edit distance of 2, this means that you're only interested in words where

    2-gram "fo" exists in with position from 1 to 3 or
    2-gram "oo" exists in with position from 2 to 4 or
    ... and so on
    

    String "barfoo" doesn't get any matches because the positions of the otherwise matching 2-grams differ by 3.

  2. Also, it might be useful to use the relation between edit distance and the count of matching q-grams. The intution is that since

    a string s has len(s)-q+1 q-grams

    and

    a single edit operation can affect at most q q-grams,

    we can deduce that

    strings s1 and s2 within edit distance of d have at least max(len(s1),len(s2))-q+1-qk matching non-positional q-grams.

    If you're searching for "foobar" with an maximum edit distance of 2, a matching 7-character string (such as "fotocar") should contain at least two common 2-grams.

  3. Finally, the obvious thing to do is to filter by lenght. The edit distance between two strings is at least the difference of the lengths of the strings. For example if your threshold is 2 and you search for "foobar", "foobarbar" cannot obviously match.

See http://pages.stern.nyu.edu/~panos/publications/deb-dec2001.pdf for more and some pseudo SQL.

Ville Koskinen
A: 

Interesting paper about indexing DNA q-grams so you don't have to scan the entire table:

www.comp.nus.edu.sg/~atung/publication/qgram_edit.pdf

234523458