ansaurus

Question

Finding groups of similar strings in a large set of strings.

Answer 1

+2 A:

The problem you're trying to solve is a typical clusterization problem.

Start with simple K-Means algorithm and use Levenshtein distance as a function for calculating distance between elements and clusters centers.

BTW, algorithm for Levenshtein distance calculation is implemented in Apache Commons StringUtils - StringUtils.getLevenshteinDistance

The main problem of K-Means is that you should specify the number of clusters (subgroups in your terms). So, you'll have 2 options: improve K-Means with some euristic or use another clusterization algorithm which doesn't require specifying clusters number (but that algorithm can show worse performance and can be very difficult in implemenation if you decide to implement it yourself).

Roman 2010-07-25 13:20:20

I knew that "grouping" wasn't quite the right word for what I was trying to do. Those links look pretty useful too. Thanks!

latentflip 2010-07-25 13:34:00

Answer 2

+1 A:

For the example you give, I reckon Levenshtein distance would be unsuitable as "Bonny Smith" would be 'very similar' to "Jonny Smith" and would almost certainly end up being considered in the same class.

I think you need to approach this (if working with names) from the point-of-view of certain names having synonyms (e.g. "John", "Jon", "Jonny", "Johnny" etc.) and matching based on these.

Will A 2010-07-25 13:21:01

William-->Bill... If it's a real-world problem then its solution is not trivial at all and requires some already made researches in linguistics and probably already filled DBs.

Roman 2010-07-25 13:28:19

Answer 3

+5 A:

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/

Luther Blissett 2010-07-25 13:24:55

That looks like a very useful link, thanks!

latentflip 2010-07-25 13:32:37

Only if you treat strings as set of words (i.e., neither order of words nor their cardinality matters). If order matters, Levenshtein distance (as mentioned by Roman) is much more accurate (though much slower to compute).

Dimitris Andreou 2010-07-25 16:07:02

Answer 4

A:

http://docs.python.org/library/re.html maybe this could have helped ?? regular expressions module of python. with this module you can match strings for example strings that include "*ohn" etc. im quite busy right now but it is so easy to work with this module im sure you can handle it with the help of doc. on python website

Ahmet Yıldırım 2010-07-25 13:34:22

Answer 5

+1 A:

If we're talking about actual pronouncable words, comparing the (start of) their metaphone might be of assistance:

MRFLPRBRTS: Mr Philip Roberts
FLRBRTS: Phil Roberts   
FLPRBRTS: Philip Roberts 
FMKBR: Foo McBar      
TFTJNS: David Jones    
TFJNS: Dave Jones     
TFJNS: Davey Jones    
JNT: Jane Doe       
JNSM0: John Smith     
JNSM0: Jonny Smith

Wrikken 2010-07-25 13:37:06

ansaurus

tags:

views:

answers:

Finding groups of similar strings in a large set of strings.

related questions