Divide and conquer. Split up your problem into many smaller problems and solve each of them.
First problem: turn a paragraph into a list of words.
You are fortunate because you don't have to worry about being perfect. Actually parsing natural languages to determine exactly what "a word" is can be very difficult, but frankly you probably don't really care whether "lightbulb" has the the same semantics as "light bulb". Since you are in particular looking for common words (for now, more on that later) the interesting ones are precisely those that are easy to identify because they come up a lot.
So, break this problem down further. You want a list of words. Start by getting a string with the text in it:
StreamReader streamReader = new StreamReader(@"c:\survey.txt");
string source = streamReader.ReadToEnd();
Great, you've got a string somehow. Now turn that into an array of words. Because you probably want to count "Frog" and "frog" as the same word, make everything lowercase. How to do all that that? Split the lowercase string up based on spaces, newlines, tabs and punctuation:
char[] punctuation = new char[] {' ', '\n', '\r', '\t', '(', ')', '"'};
string[] tokens = source.ToLower().Split(punctuation, true);
Now examine the output. That was terrible. There's all kinds of stuff we forget. Periods and commas and colons and semicolons and so on. Figure out which punctuation you care about and add it to the list.
Is ToLower the right thing to do? What about ToLowerInvariant? There are times you want to stress about it; this isn't one of them. The fact that ToLower doesn't necessarily canoncialize the Turkish lowercase I in a manner that consistently round-trips is unlikely to throw off your summary statistics. We're not going for pinpoint accuracy here. If someone says "luxury-yacht", and someone says "luxury yacht", the former might be one word if you forget to break on hyphens. Who cares? Hyphenated words are unlikely to be in your top ten anyway.
Next problem: count all the occurrences of each word:
var firstPass = new Dictionary<string, int>();
foreach(string token in tokens)
{
if (!firstPass.ContainsKey(token))
firstPass[token] = 1;
else
++firstPass[token];
}
Great. We now have a dictionary that maps words to integers. Trouble is, that's backwards. What you want to know is what are all the words that have the same number of occurrences. A dictionary is a sequence of key/value pairs, so group it:
var groups = from pair in firstPass
group pair.Key by pair.Value;
OK, now we have a sequence of groups of words, each one associated with its count of occurrences. Order it. Remember, the key of the group is the value of the dictionary, the count:
var sorted = from group in groups
orderby group.Key
select group;
And you want the top hundred, let's say:
foreach(var g in sorted.Take(100))
{
Console.WriteLine("Words with count {0}:", g.Key);
foreach(var w in g)
Console.WriteLine(w);
}
And you're done.
Now, is this really what you're interested in? I think it might be more interesting to look for unusual words, or pairs of words. If the words "yacht" and "racing" show up together a lot, not a surprise. If "tomato" and "ketchup" show up a lot together, not surprising. If "tomato" and "racing" start showing up together, then maybe something noteworthy is going on.
That requires much deeper analysis; read up on Bayes' Theorem if that's the sort of thing you're interested in.
Also note that this tracks the raw count of words, not their frequency -- the number of times they appear per thousand words. That might also be an interesting metric to measure: not just how many times did this word appear, period, but how many times did it appear as a percentage of the text.