Counting repeated words in a file

views:

101

answers:

+2 Q:

Counting repeated words in a file

hi,

Goal: to find count of all words in a file. file contains 1000+ words

My approach: use a HashMap() to store and count the number of times each word appears in the file.

Question: Would a HashMap() be the best way or would it be better to use a Binary Tree for ensuring faster lookup as there is a large count of words in the file?

Or is there a better way to do this?

HashMap would result in a lot of memory overhead which is not desired.

Thanks

+4 A:

1000 - 10000 words is very small.

A Hashmap will be fine.

Mitch Wheat 2010-10-15 13:06:21

A HashMap is perfect. You need to store

a copy of each word encountered
the count for each

A HashMap really won't store much more than that!

HenryTaylor 2010-10-15 13:08:46

+4 A:

So you're looking for distinct words?

The most efficient structure I can think of is a Trie

Here's one open source implementation: Google Code patricia-trie

Although I tend to agree with Mitch Wheat -- It sounds like a HashMap should work fine (It's always best to avoid premature optimization... so you should use a HashMap until you've shown that it's a bottleneck)

Michael D 2010-10-15 13:08:49

+1 for beating me to the trie

Lord Torgamus 2010-10-15 13:10:35

thanks for all the help!!! you guys are the best!

jillika iyer 2010-10-15 13:20:33

Assuming that the strings are not insanely long, a "Trie" approach as Michael suggest would be good. The node in the Trie can store the character and the "count" of the strings that end with that character. This should drastically reduce the storage requirements (again assuming the strings are uniformly distributed and overlapping)
Assuming that the counts are not to be persisted across invocations, while using a HashMap, let the Map be from Integer => Integer - where the "key" is the hashcode of the string and value the count. This should be a efficient solution - with fast lookup and reduced memory foot print.

madhurtanwani 2010-10-15 13:39:27

I would recommend doing such a task in Perl/PHP. It's very hard to kill a fly with a machine gun.

Noam 2010-10-15 13:45:36

ansaurus

tags:

views:

answers:

Counting repeated words in a file

related questions