views:

119

answers:

5

I'd like to get some community consensus on a good design to be able to store and query word frequency counts. I'm building an application in which I have to parse text inputs and store how many times a word has appeared (over time). So given the following inputs:

  • "To Kill a Mocking Bird"
  • "Mocking a piano player"

Would store the following values:

Word    Count
-------------
To      1
Kill    1
A       2
Mocking 2
Bird    1
Piano   1
Player  1

And later be able to quickly query for the count value of a given arbitrary word.

My current plan is to simply store the words and counts in a database, and rely on caching word count values ... But I suspect that I won't get enough cache hits to make this a viable solution long term.

Can anyone suggest algorithms, or data structures, or any other idea that might make this a well-performing solution?

+4  A: 

Word counting is the canonical example of a MapReduce program (pseudo code from Wikipedia):

void map(String name, String document):
  for each word w in document:
     EmitIntermediate(w, "1");

void reduce(String word, Iterator partialCounts):
  int result = 0;
  for each pc in partialCounts:
    result += ParseInt(pc);
  Emit(AsString(result));

I am not saying that this is the way to do it, but it is definitely an option if you need something that scales well when the number of distinct words outsizes the memory available on a single machine. As long as you are able to stay below the memory limit, a simple loop updating a hash table should do the trick.

Jørn Schou-Rode
+3  A: 

I don't understand why you feel a database would not be a suitable solution. You will probably only have about 100000 rows and the small size of the table will mean that it can be stored entirely in memory. Make the word the primary key and lookups will be very fast.

Mark Byers
+1 This solution is easy to implement, takes care automatically of concurrecy and caching, and probably has a good performance. I would replace it with something else only after proving that it is not efficient enough for my needs.
Eyal Schneider
+1  A: 

Use a hash table.

BlueRaja - Danny Pflughoeft
+1  A: 

Your solution sounds fine. The if the cache is based on recent usage count, then it will hold the word counts for most frequent words. (Word distribution is something like first 100 words covers 90% of word instances) so you don't need a very large cache.

If you want to improve performance and drop the db, you can encode the words as a trie, and store usage counts in the leaf nodes. In essense, that's what the database is doing if you index on word text, so you are really only avoiding the db latency. If that is the goal, then there are other ways of avoiding db latency, such as using parallel lookups.

mdma
+2  A: 

If performance is your main goal, you could use a hash based or trie based structure in RAM only. Assuming that you do some useful filtering anyway (to not count terms with non-word characters), the maximum number of words in your table will be in the range of 10⁶ to 10⁷ (even if multiple languages are involved), so this will easily fit into the memory of a current PC (and completely avoid all the database handling).

On the other hand, if you have to implement the hashing table details yourself, there is just more code that you can do wrong (while the database guys have hopefully tweaked their code to the maximum). So even minor details in your own implementation might lead to performance loss again.

So this dilemma clearly shows us the first and second rule of optimization: 1. Don't optimize prematurely. 2. Measure, before you optimize.

:)

Bananeweizen