views:

101

answers:

5

hi,

Goal: to find count of all words in a file. file contains 1000+ words

My approach: use a HashMap() to store and count the number of times each word appears in the file.

Question: Would a HashMap() be the best way or would it be better to use a Binary Tree for ensuring faster lookup as there is a large count of words in the file?

Or is there a better way to do this?

HashMap would result in a lot of memory overhead which is not desired.

Thanks

+4  A: 

1000 - 10000 words is very small.

A Hashmap will be fine.

Mitch Wheat
A: 

A HashMap is perfect. You need to store

  • a copy of each word encountered
  • the count for each

A HashMap really won't store much more than that!

HenryTaylor
+4  A: 

So you're looking for distinct words?

The most efficient structure I can think of is a Trie

Here's one open source implementation: Google Code patricia-trie

Although I tend to agree with Mitch Wheat -- It sounds like a HashMap should work fine (It's always best to avoid premature optimization... so you should use a HashMap until you've shown that it's a bottleneck)

Michael D
+1 for beating me to the trie
Lord Torgamus
thanks for all the help!!! you guys are the best!
jillika iyer
A: 
  1. Assuming that the strings are not insanely long, a "Trie" approach as Michael suggest would be good. The node in the Trie can store the character and the "count" of the strings that end with that character. This should drastically reduce the storage requirements (again assuming the strings are uniformly distributed and overlapping)

  2. Assuming that the counts are not to be persisted across invocations, while using a HashMap, let the Map be from Integer => Integer - where the "key" is the hashcode of the string and value the count. This should be a efficient solution - with fast lookup and reduced memory foot print.

madhurtanwani
A: 

I would recommend doing such a task in Perl/PHP. It's very hard to kill a fly with a machine gun.

Noam