ansaurus

Question

Answer 1

+4 A:

The NLTK contains tons of useful for dealing with natural language.

Check out this article (linked from the NLTK site) for an example of building a persistent, network-accessible frequency distribution. Even if it's not exactly what you're looking for, it might help you get a feel for how to approach your problem.

UPDATE:

Re: MS technologies, you can run NLTK on .NET using IronPython. See this related SO question.

Hank Gay 2010-05-20 14:15:06

Answer 2

A:

SharpNLP is a native .NET library for doing NLP. I don't know how it compares to NLTK, since I'd never heard of it until I went Googling.

Hank Gay 2010-05-20 14:34:55

Answer 3

+6 A:

Divide and conquer. Split up your problem into many smaller problems and solve each of them.

First problem: turn a paragraph into a list of words.

You are fortunate because you don't have to worry about being perfect. Actually parsing natural languages to determine exactly what "a word" is can be very difficult, but frankly you probably don't really care whether "lightbulb" has the the same semantics as "light bulb". Since you are in particular looking for common words (for now, more on that later) the interesting ones are precisely those that are easy to identify because they come up a lot.

So, break this problem down further. You want a list of words. Start by getting a string with the text in it:

StreamReader streamReader = new StreamReader(@"c:\survey.txt");
string source = streamReader.ReadToEnd();

Great, you've got a string somehow. Now turn that into an array of words. Because you probably want to count "Frog" and "frog" as the same word, make everything lowercase. How to do all that that? Split the lowercase string up based on spaces, newlines, tabs and punctuation:

char[] punctuation = new char[] {' ', '\n', '\r', '\t', '(', ')', '"'};
string[] tokens = source.ToLower().Split(punctuation, true);

Now examine the output. That was terrible. There's all kinds of stuff we forget. Periods and commas and colons and semicolons and so on. Figure out which punctuation you care about and add it to the list.

Is ToLower the right thing to do? What about ToLowerInvariant? There are times you want to stress about it; this isn't one of them. The fact that ToLower doesn't necessarily canoncialize the Turkish lowercase I in a manner that consistently round-trips is unlikely to throw off your summary statistics. We're not going for pinpoint accuracy here. If someone says "luxury-yacht", and someone says "luxury yacht", the former might be one word if you forget to break on hyphens. Who cares? Hyphenated words are unlikely to be in your top ten anyway.

Next problem: count all the occurrences of each word:

var firstPass = new Dictionary<string, int>();
foreach(string token in tokens)
{
    if (!firstPass.ContainsKey(token))
        firstPass[token] = 1;
    else
        ++firstPass[token];
}

Great. We now have a dictionary that maps words to integers. Trouble is, that's backwards. What you want to know is what are all the words that have the same number of occurrences. A dictionary is a sequence of key/value pairs, so group it:

var groups = from pair in firstPass
             group pair.Key by pair.Value;

OK, now we have a sequence of groups of words, each one associated with its count of occurrences. Order it. Remember, the key of the group is the value of the dictionary, the count:

var sorted = from group in groups
             orderby group.Key
             select group;

And you want the top hundred, let's say:

foreach(var g in sorted.Take(100))
{
  Console.WriteLine("Words with count {0}:", g.Key);
  foreach(var w in g)
    Console.WriteLine(w);
}

And you're done.

Now, is this really what you're interested in? I think it might be more interesting to look for unusual words, or pairs of words. If the words "yacht" and "racing" show up together a lot, not a surprise. If "tomato" and "ketchup" show up a lot together, not surprising. If "tomato" and "racing" start showing up together, then maybe something noteworthy is going on.

That requires much deeper analysis; read up on Bayes' Theorem if that's the sort of thing you're interested in.

Also note that this tracks the raw count of words, not their frequency -- the number of times they appear per thousand words. That might also be an interesting metric to measure: not just how many times did this word appear, period, but how many times did it appear as a percentage of the text.

Eric Lippert 2010-05-20 17:29:32

+1 pragmatic approach, and not only giving OP the fish, but teaching OP how to fish.

Zano 2010-05-20 19:18:38

Answer 4

A:

You can create a lucene index of the text with custom stop word list which will skip common words. Open the lucene index with Luke and it will show you the top terms in the index.

You can enable stemming while indexing so that words are grouped to their root form. That will help you club together different forms of the same word (plurals, different tenses, etc.). That is "quetions, question, questioned" etc will show up as "question." This you can't do with any other method.

Shashikant Kore 2010-05-21 06:08:24

ansaurus

tags:

views:

answers:

Keyword sorting algorithm

related questions