views:

174

answers:

9

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)

I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.

My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.

What approach would be best?

+2  A: 

I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.

Robert H.
Assuming he's using .net, yes that would be good.
C. Ross
Isn't Dictionary just a hash table? I'm not using .NET btw.
erjiang
A: 

a simple python script:

import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
    counts[line.strip()] +=1

print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())
zzzeek
OP is asking for an algorithm and you gave him code?
TStamper
ah right. If it were python, using hash builtins would be faster than a trie since its using native code.
zzzeek
also hash is O(1) anyway...
zzzeek
@zzzeek: Hash table look-ups are O(1), Getting the hash digest of a word is O(n) where n is the length of the word.
Ben S
You can replace `lambda: 0` by `int` (because `int()` will return 0 as well).
Stephan202
wow never knew that !
zzzeek
A: 

I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)

JSBangs
+1  A: 

I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.

How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.

If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.

Jamie Macey
A: 

Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.

Jerry Coffin
+2  A: 

A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.

With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.

Aric TenEyck
Depending on the hashing algorithm, a hash could also be O(n). Or better, or worse.
Annabelle
A decent hash algorithm is always O(n), where `n` is the length of the word, so the big-O complexity is basically the same, just happening in different parts of the algorithm.
Jerry Coffin
Like they said, you'd have to at least scan through the linear word to hash it or traverse the trie with it, so there'd be an O(n) operation either way.
erjiang
The more I think about it, the more I realize that yeah, a hash algorithm will also be O(n). However, a good hash algorithm on a modern processor can handle 64 bits of the key at a time, while a trie will handle one or maybe two bytes of the key at a time.
Aric TenEyck
+1  A: 

I think a trie with the count as the leaves could be faster.

Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.

A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.

For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.

If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.

However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.

Ben S
A: 

Use Python!

Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.

This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)

mduvall
A: 

To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?

Dan Blanchard