ansaurus

Question

Good Python modules for fuzzy string comparison?

Answer 1

+36 A:

difflib can do it.

Example from the docs:

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

Check it out. It has other functions that can help you build something custom.

nosklo 2009-03-25 16:34:09

+1 Neat, I don't recall ever seeing this before

Van Gale 2009-03-25 17:07:02

+1: Quote the documents.

S.Lott 2009-03-25 17:13:37

+1: Great to be introduced to a module I've not used before.

Jarret Hardie 2009-03-25 17:51:35

I've actually used difflib before, but found that I couldn't just ask for a percentage match amount. Its been a while though.

Soviut 2009-03-25 19:33:25

@Soviut: e.g. `difflib.SequenceMatcher(None, 'foo', 'bar').ratio()` returns a value between 0-1 which can be interpreted as match percentage. Right?

utku_karatas 2010-04-28 10:38:46

Answer 2

+3 A:

While not specific to Python, here is a question about similar string algorithms:

http://stackoverflow.com/questions/451884/similar-string-algorithm/451910#451910

Dana 2009-03-25 16:36:18

Answer 3

+10 A:

I like nosklo's answer; another method is the Damerau-Levenshtein distance:

"In information theory and computer science, Damerau–Levenshtein distance is a 'distance' (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters."

An implementation in Python from Wikibooks:

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), \
    lev(a[1:], b)+1, lev(a, b[1:])+1)

More from Wikibooks, this gives you the length of the longest common substring (LCS):

def LCSubstr_len(S, T):
    m = len(S); n = len(T)
    L = [[0] * (n+1) for i in xrange(m+1)]
    lcs = 0
    for i in xrange(m):
        for j in xrange(n):
            if S[i] == T[j]:
                L[i+1][j+1] = L[i][j] + 1
                lcs = max(lcs, L[i+1][j+1])
    return lcs

Adam Bernier 2009-03-25 16:46:32

Thanks, I found some information about Levenshtein while doing my initial searching, but the examples were far too vague. Your answer is excellent.

Soviut 2009-03-25 19:34:40

I chose this one because it gives me a nice scalar number I can work with and use for thresholds.

Soviut 2009-03-25 19:37:13

Answer 4

+5 A:

There is also Google's own google-diff-match-patch ("Currently available in Java, JavaScript, C++ and Python").

(Can't comment on it, since I have only used python's difflib myself)

Steven 2009-03-25 17:47:33

Answer 5

+16 A:

Levenshtein Python extension and C library.

http://code.google.com/p/pylevenshtein/

The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.

>>> import Levenshtein

>>> help(Levenshtein.ratio)

ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)

distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0

Pete Skomoroch 2009-03-26 07:18:51

Answer 6

+1 A:

Here's a python script for computing longest comon substring of two words--may ned tweaking to work for multi-word phrases:

def lcs(word1, word2):

w1 = set(word1[i:j] for i in range(0, len(word1))
         for j in range(1, len(word1) + 1))

w2 = set(word2[i:j] for i in range(0, len(word2))
         for j in range(1, len(word2) + 1))

common_subs     = w1.intersection(w2)

sorted_cmn_subs = sorted([
    (len(str), str) for str in list(common_subs)
    ])

return sorted_cmn_subs.pop()[1]

2009-04-20 16:32:11

Answer 7

+4 A:

As nosklo said, use the difflib module from the Python standard library.

The difflib module can return a measure of the sequences' similarity using the ratio() method of a SequenceMatcher() object. The similarity is returned as a float in the range 0.0 to 1.0.

>>> import difflib

>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0

>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()
0.80000000000000004

>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()
0.0

Edi H 2010-03-10 17:03:57

Not terribly impressed by SequenceMatcher. It gives the same score to David/Daved that it gives to David/david.

Leeks and Leaks 2010-05-28 18:00:49

ansaurus

tags:

views:

answers:

Good Python modules for fuzzy string comparison?

related questions