ansaurus

Question

How would you develop a frequency-sorted list of the ten thousand most-used words in the English language?

Answer 1

+1 A:

I would use map-reduce. This is a canonical example of a task well-suited for it. You can use Hadoop with C# with the streaming protocol. There are also other approaches. See http://stackoverflow.com/questions/339344/is-there-a-net-equivalent-to-apache-hadoop and http://stackoverflow.com/questions/436686/-net-mapreduce-implementation.

Matthew Flaschen 2009-05-14 19:37:20

Answer 2

+2 A:

IEnumerable<string> inputList; // input words.
var mostFrequentlyUsed = inputList.GroupBy(word => word)
  .Select(wordGroup => new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
  .OrderByDescending(word => word.Frequency);

Explanation: I don't really know if it requires further explanation but I'll try. inputList is an array or any other collection providing source words. GroupBy function will group the input collection by some similar property (which is, in my code the object itself, as noted by the lambda word => word). The output (which is a set of groups by a specified key, the word) will be transformed to an object with Word and Frequency properties and sorted by Frequency property in descending order. You could use .Take(10000) to take the first 10000. The whole thing can be easily parallelized by .AsParallel() provided by PLINQ. The query operator syntax might look clearer:

var mostFrequentlyUsed = 
     (from word in inputList
      group word by word into wordGroup
      select new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
     .OrderByDescending(word => word.Frequency).Take(10000);

Mehrdad Afshari 2009-05-14 19:42:23

That can be very inefficient, because it's not parallelized.

Matthew Flaschen 2009-05-14 19:45:04

Tens of thousand words is not something to worry about and btw, look at PLINQ: http://msdn.microsoft.com/en-us/magazine/cc163329.aspx

Mehrdad Afshari 2009-05-14 19:55:45

Where would you propose getting the source input words? That seems to be the toughest part of the problem, but you're taking it as a given.

Dennis Palmer 2009-05-14 20:24:17

@Dennis: That's tough. But I don't think it's not-programming-related. Wiktionary guys take it from Project Guthenberg text

Mehrdad Afshari 2009-05-14 20:31:35

Answer 3

+1 A:

First thing to pop into my head (not syntax checked, and verbose (for perl) for demonstrative purposes)

#!/usr/bin/perl

my %wordFreq
foreach ( my $word in @words)
{
   $wordFreq{$word}++;
}

my @mostPopularWords = sort{$wordFreq{$a} <=> $wordFreq{$b} } keys %wordFreq;
for (my $i=0; $i < 10000; ++$i)
{
   print "$i: $mostPopularWords[$i] ($wordFreq{$mostPopularWords[$i]} hits)\n"
}

cyberconte 2009-05-14 19:47:22

Answer 4

A:

As a first cut, absent further definition of the problem (just what do you mean by the most-used words in English?) -- I'd buy Google's n-gram data, intersect the 1-grams with an English dictionary, and pipe that to sort -rn -k 2 | head -10000.

Darius Bacon 2009-05-15 02:50:38

ansaurus

tags:

views:

answers:

How would you develop a frequency-sorted list of the ten thousand most-used words in the English language?

related questions