ansaurus

Question

Answer 1

+4 A:

Word counting is the canonical example of a MapReduce program (pseudo code from Wikipedia):

void map(String name, String document):
  for each word w in document:
     EmitIntermediate(w, "1");

void reduce(String word, Iterator partialCounts):
  int result = 0;
  for each pc in partialCounts:
    result += ParseInt(pc);
  Emit(AsString(result));

I am not saying that this is the way to do it, but it is definitely an option if you need something that scales well when the number of distinct words outsizes the memory available on a single machine. As long as you are able to stay below the memory limit, a simple loop updating a hash table should do the trick.

Jørn Schou-Rode 2010-05-17 20:54:40

Answer 2

+3 A:

I don't understand why you feel a database would not be a suitable solution. You will probably only have about 100000 rows and the small size of the table will mean that it can be stored entirely in memory. Make the word the primary key and lookups will be very fast.

Mark Byers 2010-05-17 20:54:49

+1 This solution is easy to implement, takes care automatically of concurrecy and caching, and probably has a good performance. I would replace it with something else only after proving that it is not efficient enough for my needs.

Eyal Schneider 2010-05-17 22:27:53

Answer 3

+1 A:

Use a hash table.

BlueRaja - Danny Pflughoeft 2010-05-17 20:56:10

Answer 4

+1 A:

Your solution sounds fine. The if the cache is based on recent usage count, then it will hold the word counts for most frequent words. (Word distribution is something like first 100 words covers 90% of word instances) so you don't need a very large cache.

If you want to improve performance and drop the db, you can encode the words as a trie, and store usage counts in the leaf nodes. In essense, that's what the database is doing if you index on word text, so you are really only avoiding the db latency. If that is the goal, then there are other ways of avoiding db latency, such as using parallel lookups.

mdma 2010-05-17 20:57:51

Answer 5

+2 A:

If performance is your main goal, you could use a hash based or trie based structure in RAM only. Assuming that you do some useful filtering anyway (to not count terms with non-word characters), the maximum number of words in your table will be in the range of 10⁶ to 10⁷ (even if multiple languages are involved), so this will easily fit into the memory of a current PC (and completely avoid all the database handling).

On the other hand, if you have to implement the hashing table details yourself, there is just more code that you can do wrong (while the database guys have hopefully tweaked their code to the maximum). So even minor details in your own implementation might lead to performance loss again.

So this dilemma clearly shows us the first and second rule of optimization: 1. Don't optimize prematurely. 2. Measure, before you optimize.

:)

Bananeweizen 2010-05-17 21:30:51

ansaurus

tags:

views:

answers:

Tracking/Counting Word Frequency

related questions