Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python

views:

answers:

+3 Q:

Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python

I'm downloading a long list of my email subject lines , with the intent of finding email lists that I was a member of years ago, and would want to purge them from my Gmail account (which is getting pretty slow.)

I'm specifically thinking of newsletters that often come from the same address, and repeat the product/service/group's name in the subject.

I'm aware that I could search/sort by the common occurrence of items from a particular email address (and I intend to), but I'd like to correlate that data with repeating subject lines....

Now, many subject lines would fail a string match, but "Google Friends : Our latest news" "Google Friends : What we're doing today" are more similar to each other than a random subject line, as is: "Virgin Airlines has a great sale today" "Take a flight with Virgin Airlines"

So -- how can I start to automagically extract trends/examples of strings that may be more similar.

Approaches I've considered and discarded ('because there must be some better way'):

Extracting all the possible substrings and ordering them by how often they show up, and manually selecting relevant ones
Stripping off the first word or two and then count the occurrence of each sub string
Comparing Levenshtein distance between entries
Some sort of string similarity index ...

Most of these were rejected for massive inefficiency or likelyhood of a vast amount of manual intervention required. I guess I need some sort of fuzzy string matching..?

In the end, I can think of kludgy ways of doing this, but I'm looking for something more generic so I've added to my set of tools rather than special casing for this data set.

After this, I'd be matching the occurring of particular subject strings with 'From' addresses - I'm not sure if there's a good way of building a data structure that represents how likely/not two messages are part of the 'same email list' or by filtering all my email subjects/from addresses into pools of likely 'related' emails and not -- but that's a problem to solve after this one.

Any guidance would be appreciated.

The word you're looking for is "Bayesian".

Ignacio Vazquez-Abrams 2010-05-02 00:34:54

+2 A:

I would first turn each string of characters into a set or multiset of words (ignoring punctuation and differences in lower/upper case). (If that's not powerful enough, in a second pass I could try pairs or even triples of adjacent words, known as bigrams and trigrams). The key measure of similarity between strings thus reduced is, what words that are not high frequency overall (not the, and, etc;-) are common to both strings, so a simple set intersection (or multiset intersection, but for your simple use case I think just sets would work fine, esp. sets of bigrams) should suffice to measure the "commonality". A word that's common to the two strings should be worth more the rarer it is, so negative log of frequency of the word across the whole corpus is an excellent starting point for this heuristics.

Alex Martelli 2010-05-02 00:54:38

This is an interesting approach - a concern: one of my goals for this project was to learn of already-existing algorithms rather than writing one of my own for this problem, so that I had a better understanding of the problem space. This feels like a specific 'would good for this case' type approach rather than a 'here's a commonly used tool', and I'm a bit afraid of the computational intensity. (Then again, it's still the best answer I've gotten so far, so thanks!)

Rizwan Kassim 2010-05-03 19:10:16

@RizwanK, working on streams of words (often somewhat-normalized, e.g. wrt case) rather than of characters is a very common approach (rather than "tool";-) in information retrieval, and sets or multisets (of words, bigrams, trigrams) anything but rare. If you're looking for existing Python code to help with this you may find something in NLTK, though I'm not sure. But I definitely wasn't inventing data-management approaches on the fly for your specific problem only;-). If anything, the shortness of email subject wrt the usual "documents" handled in IR should _reduce_ computation load!

Alex Martelli 2010-05-03 21:01:36

"But I definitely wasn't inventing data-management approaches on the fly for your specific problem only"*grin* Fair point. Do you know if there's a name for the approach that you suggested? Thanks!

Rizwan Kassim 2010-05-05 08:10:17

Alex Martelli 2010-05-05 14:21:45

+1 A:

Smooth BLEU

You might be able to make use of the smooth-BLEU score between the subjects. BLEU is an evaluation metric that is used to score how similar the translations produced by a machine translation system are to translations produced by humans. Smooth BLEU is calculated like the normal BLEU score, except that you add one to the n-gram match counts so that you avoid multiplying anything by zero when evaluating short segments of text.

Smooth-BLEU should be much faster to compute than the Levenshtein distance, while still capturing word order information, since it looks at n-gram matches rather than just matches between single words.

Unfortunately, I don't have a pointer to a Python BLEU implementation, but you'll find a Perl implementation from NIST here.

dmcer 2010-05-03 00:32:37

I found this SO question that posits some better known algorithms as well.

http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings

Rizwan Kassim 2010-05-03 19:10:58

ansaurus

tags:

views:

answers:

Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python

related questions