ansaurus

Question

Match rows containing a word with permutations

Answer 1

+1 A:

Could you try training it on a small sample of your table to find possible misspellings (using split + Levenshtein) and then use the resulting word list on the full table?

Simon Nickerson 2009-04-08 22:18:23

Answer 2

+1 A:

Are trying to do this in TSQL or what language?

You might be able to hit the majority of them with a regular expression.

some variation of the following

"p(er|re|e)f{1,2}er{1,2}ed"

"r(e|i)s?t(ri|ir|rti|i)ct?ed"

you want to make sure that is not caps sensitive...

J.13.L 2009-04-08 22:22:54

It's mysql, whose regex support leaves much to be desired but should be able to handle something like this

ʞɔıu 2009-04-08 22:25:48

any thoughts along these lines for 'restricted'? that may be all i need

ʞɔıu 2009-04-08 22:45:15

"r(e|i)s?t(ri|ir|rti|i)ct?ed"you want to make sure that is not caps sensitive...

J.13.L 2009-04-09 15:26:06

That works for all the examples you gave... You may need to adjust it abit, let me know if you need help.

J.13.L 2009-04-09 15:27:30

Answer 3

+2 A:

Create two more table, of spellings and of possible speelings:

-- you can figure out the types

create table spelling ( id, word ) ; 
create table possible_spelling 
( id, spelling_id references spelling(id), spelling ) 
-- possible spelling also includes the correct spelling
-- all values are lowercase

insert into spelling( word ) values ('preferred');
insert into possible_spelling( spelling_id, spelling ) 
 select 1, '%preferred%' union select 1, '%prefered%' union ....;

select * 
from bigtable a 
join possible_spelling b
on (lower(a.data) like b.spelling )
join spelling c on (b.spelling_id = c.id) 
where c.word = 'preferred';

Objection: this'll be slow, and requires setup. Answer: not that slow, and this should be a one time thing to categorize and fix your data. One time to setup, one time per incoming row to categorize.

tpdi 2009-04-08 22:31:53

Answer 4

+1 A:

I might do something like this -- if you can get away with Levenshtein once -- here is an amazing spellchecker implementation by Peter Norvig:

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in s if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]
   inserts    = [a + c + b     for a, b in s for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

He makes a training set available here: http://norvig.com/big.txt Here is sample output:

>>> correct('prefferred')
'preferred'
>>> correct('ristricted')
'restricted'
>>> correct('ristrickted')
'restricted'

In your case, you could copy the original column to a new one, but pass it through the spellchecker when you do. Then put a fulltext index on the correctly spelled column, and match your queries against it, but return results from the original column. You only have to do this once, as opposed to calculating distances every time. You might spellcheck the input too, or check the corrected version as a fallback only. Either way it is worth studying the Norvig example.

bvmou 2009-04-10 07:59:34

ansaurus

tags:

views:

answers:

Match rows containing a word with permutations

related questions