ansaurus

Question

Answer 1

+1 A:

Is that what you mean?

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

look at http://docs.python.org/library/difflib.html#difflib.get%5Fclose%5Fmatches

Tzury Bar Yochay 2009-09-24 11:46:47

Thank you. This will probably give me some good ideas, but not what I am looking for `get_close_matches('appel', ['ape', 'peach', 'puppy'])` gets me ape. I can set the cutoff, but from some quick experiments, false-positives are far too common.

uswaretech 2009-09-24 12:31:06

Answer 2

+4 A:

There's a great resource for string similarity metrics at University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

Hamming distance
Levenshtein distance
Needleman-Wunch distance or Sellers Algorithm
and many more...

ariddell 2009-09-24 12:13:28

Answer 3

+4 A:

I realize it's not the same thing, but this is close enough:

>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

You can make this as a function

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False

Nadia Alramli 2009-09-24 13:10:55

Answer 4

+2 A:

I would use Levenshtein distance, or the so-called Damerau distance (which takes transpositions into account) rather than the difflib stuff for two reasons (1) "fast enough" (dynamic programming algo) and "whoooosh" (bit-bashing) C code is available and (2) well-understood behaviour e.g. Levenshtein satisfies the triangle inequality and thus can be used in e.g. a Burkhard-Keller tree.

Threshold: you should treat as "positive" only those cases where distance < (1 - X) * max(len(string1), len(string2)) and adjust X (the similarity factor) to suit yourself. One way of choosing X is to get a sample of matches, calculate X for each, ignore cases where X < say 0.8 or 0.9, then sort the remainder in descending order of X and eye-ball them and insert the correct result and calculate some cost-of-mistakes measure for various levels of X.

N.B. Your ape/apple example has distance 2, so X is 0.6 ... I would only use a threshold as low as 0.75 if I were desperately looking for something and had a high false-negative penalty

John Machin 2009-09-24 14:41:31

ansaurus

tags:

views:

answers:

String similarity metrics in Python

related questions