views:

169

answers:

4

Say you've got a big table that contains a varchar column.

How would you match rows that contain the word 'preferred' in the varchar col BUT the data is somewhat noisy and contains occasional spelling errors, e.g.:

['$2.10 Cumulative Convertible Preffered Stock, $25 par value',
'5.95% Preferres Stock',
'Class A Preffered',
'Series A Peferred Shares',
'Series A Perferred Shares',
'Series A Prefered Stock',
'Series A Preffered Stock',
'Perfered',
'Preffered  C']

The permutations of the word 'preferred' in the spelling errors above appear to exhibit a family resemblance but there's very little that they all have in common. Note that splitting out every word and running levenshtein on every word in every row is going to be prohibitively expensive.

UPDATE:

There are a couple of other examples like this, e.g. with 'restricted':

['Resticted Stock Plan',
'resticted securities',
'Ristricted Common Stock',
'Common stock (restrticted, subject to vesting)',
'Common Stock (Retricted)',
'Restircted Stock Award',
'Restriced Common Stock',]
+1  A: 

Could you try training it on a small sample of your table to find possible misspellings (using split + Levenshtein) and then use the resulting word list on the full table?

Simon Nickerson
+1  A: 

Are trying to do this in TSQL or what language?

You might be able to hit the majority of them with a regular expression.

some variation of the following

"p(er|re|e)f{1,2}er{1,2}ed"

"r(e|i)s?t(ri|ir|rti|i)ct?ed"

you want to make sure that is not caps sensitive...

J.13.L
It's mysql, whose regex support leaves much to be desired but should be able to handle something like this
ʞɔıu
any thoughts along these lines for 'restricted'? that may be all i need
ʞɔıu
"r(e|i)s?t(ri|ir|rti|i)ct?ed"you want to make sure that is not caps sensitive...
J.13.L
That works for all the examples you gave... You may need to adjust it abit, let me know if you need help.
J.13.L
+2  A: 

Create two more table, of spellings and of possible speelings:

-- you can figure out the types

create table spelling ( id, word ) ; 
create table possible_spelling 
( id, spelling_id references spelling(id), spelling ) 
-- possible spelling also includes the correct spelling
-- all values are lowercase

insert into spelling( word ) values ('preferred');
insert into possible_spelling( spelling_id, spelling ) 
 select 1, '%preferred%' union select 1, '%prefered%' union ....;

select * 
from bigtable a 
join possible_spelling b
on (lower(a.data) like b.spelling )
join spelling c on (b.spelling_id = c.id) 
where c.word = 'preferred';

Objection: this'll be slow, and requires setup. Answer: not that slow, and this should be a one time thing to categorize and fix your data. One time to setup, one time per incoming row to categorize.

tpdi
+1  A: 

I might do something like this -- if you can get away with Levenshtein once -- here is an amazing spellchecker implementation by Peter Norvig:

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in s if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]
   inserts    = [a + c + b     for a, b in s for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

He makes a training set available here: http://norvig.com/big.txt Here is sample output:

>>> correct('prefferred')
'preferred'
>>> correct('ristricted')
'restricted'
>>> correct('ristrickted')
'restricted'

In your case, you could copy the original column to a new one, but pass it through the spellchecker when you do. Then put a fulltext index on the correctly spelled column, and match your queries against it, but return results from the original column. You only have to do this once, as opposed to calculating distances every time. You might spellcheck the input too, or check the corrected version as a fallback only. Either way it is worth studying the Norvig example.

bvmou