Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like "the quick brown fox jumps over the lazy dog" or "lorem ipsum dolor sit amet". What techniques could/should I use to "mine" these phrases? I'm not interested in mining single words or short phrases. Also I need to filter out phrases that I already know occur in all mails.
Example:
string mailbody1 = "Welcome to the world of tomorrow! This is the first mail body. Lorem ipsum dolor sit AMET. Have a nice day dude. Cya!";
string mailbody2 = "Welcome to the world of yesterday! Lorem ipsum dolor sit amet Please note this is the body of the second mail. Have a nice day.";
string mailbody3 = "A completely different body.";
string[] mailbodies = new[] {mailbody1, mailbody2, mailbody3};
string[] ignoredPhrases = new[] {"Welcome to the world of"};
string[] results = DiscoverPhrases(mailbodies, ignoredPhrases);
In this example I want the DiscoverPhrases function to return "lorem ipsum dolor sit amet" and "have a nice day". It's not that important if the function also returns shorter "noise" phrases, but if its possible it would be nice to eliminate these in the process.
Edit: I forgot to include mailbody3 in the example.