views:

224

answers:

3

Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like "the quick brown fox jumps over the lazy dog" or "lorem ipsum dolor sit amet". What techniques could/should I use to "mine" these phrases? I'm not interested in mining single words or short phrases. Also I need to filter out phrases that I already know occur in all mails.

Example:

string mailbody1 = "Welcome to the world of tomorrow! This is the first mail body. Lorem ipsum dolor sit AMET. Have a nice day dude. Cya!";
string mailbody2 = "Welcome to the world of yesterday! Lorem ipsum dolor sit amet Please note this is the body of the second mail. Have a nice day.";
string mailbody3 = "A completely different body.";
string[] mailbodies = new[] {mailbody1, mailbody2, mailbody3};
string[] ignoredPhrases = new[] {"Welcome to the world of"};

string[] results = DiscoverPhrases(mailbodies, ignoredPhrases);

In this example I want the DiscoverPhrases function to return "lorem ipsum dolor sit amet" and "have a nice day". It's not that important if the function also returns shorter "noise" phrases, but if its possible it would be nice to eliminate these in the process.

Edit: I forgot to include mailbody3 in the example.

A: 

I'm not sure if this what you want but check out longest common substring problem and diff utility algorithms.

Nick D
A: 

Something like this might work, depending on whether you care about word boundaries. In pseudo-code (where LCS is a function for computing the Longest Common Subsequence):

someMinimumLengthParameter = 20;
foundPhrases = [];

do {
    lcs = LCS(mailbodies);
    if (lcs in ignoredPhrases) continue;

    foundPhrases += lcs;

    for body in mailbodies {
        body.remove(lcs);
    }    
} while(lcs.length > someMinimumLengthParameter);
Dominic Rodger
+3  A: 

Have a look at N-grams. The most common phrases will necessarily contribute the most common N-grams. I'd start out with word trigrams and see where that leads. (Space required is N times the length of the text, so you can't let N get too big.) If you save the positions and not just a count, you can then see if the trigrams can be extended to form common phrases.

Norman Ramsey
Thanks. Thats a great tip!
JohannesH