tags:

views:

194

answers:

4

I was once asked by a current employee how I would develop a frequency-sorted list of the ten thousand most-used words in the English language. Suggest a solution in the language of your choosing, though I prefer C#.

Please provide not only an implementation, but also an explanation.

Thanks

+1  A: 

I would use map-reduce. This is a canonical example of a task well-suited for it. You can use Hadoop with C# with the streaming protocol. There are also other approaches. See http://stackoverflow.com/questions/339344/is-there-a-net-equivalent-to-apache-hadoop and http://stackoverflow.com/questions/436686/-net-mapreduce-implementation.

Matthew Flaschen
+2  A: 
IEnumerable<string> inputList; // input words.
var mostFrequentlyUsed = inputList.GroupBy(word => word)
  .Select(wordGroup => new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
  .OrderByDescending(word => word.Frequency);

Explanation: I don't really know if it requires further explanation but I'll try. inputList is an array or any other collection providing source words. GroupBy function will group the input collection by some similar property (which is, in my code the object itself, as noted by the lambda word => word). The output (which is a set of groups by a specified key, the word) will be transformed to an object with Word and Frequency properties and sorted by Frequency property in descending order. You could use .Take(10000) to take the first 10000. The whole thing can be easily parallelized by .AsParallel() provided by PLINQ. The query operator syntax might look clearer:

var mostFrequentlyUsed = 
     (from word in inputList
      group word by word into wordGroup
      select new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
     .OrderByDescending(word => word.Frequency).Take(10000);
Mehrdad Afshari
That can be very inefficient, because it's not parallelized.
Matthew Flaschen
Tens of thousand words is not something to worry about and btw, look at PLINQ: http://msdn.microsoft.com/en-us/magazine/cc163329.aspx
Mehrdad Afshari
Where would you propose getting the source input words? That seems to be the toughest part of the problem, but you're taking it as a given.
Dennis Palmer
@Dennis: That's tough. But I don't think it's not-programming-related. Wiktionary guys take it from Project Guthenberg text
Mehrdad Afshari
+1  A: 

First thing to pop into my head (not syntax checked, and verbose (for perl) for demonstrative purposes)

#!/usr/bin/perl

my %wordFreq
foreach ( my $word in @words)
{
   $wordFreq{$word}++;
}

my @mostPopularWords = sort{$wordFreq{$a} <=> $wordFreq{$b} } keys %wordFreq;
for (my $i=0; $i < 10000; ++$i)
{
   print "$i: $mostPopularWords[$i] ($wordFreq{$mostPopularWords[$i]} hits)\n"
}
cyberconte
A: 

As a first cut, absent further definition of the problem (just what do you mean by the most-used words in English?) -- I'd buy Google's n-gram data, intersect the 1-grams with an English dictionary, and pipe that to sort -rn -k 2 | head -10000.

Darius Bacon