ansaurus

Question

Optimizing word count

Answer 1

+2 A:

I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.

Robert H. 2009-11-02 20:19:50

Assuming he's using .net, yes that would be good.

C. Ross 2009-11-02 20:29:36

Isn't Dictionary just a hash table? I'm not using .NET btw.

erjiang 2009-11-03 16:45:36

Answer 2

A:

a simple python script:

import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
    counts[line.strip()] +=1

print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())

zzzeek 2009-11-02 20:21:08

OP is asking for an algorithm and you gave him code?

TStamper 2009-11-02 20:22:19

ah right. If it were python, using hash builtins would be faster than a trie since its using native code.

zzzeek 2009-11-02 20:25:55

also hash is O(1) anyway...

zzzeek 2009-11-02 20:26:30

@zzzeek: Hash table look-ups are O(1), Getting the hash digest of a word is O(n) where n is the length of the word.

Ben S 2009-11-02 20:30:38

You can replace `lambda: 0` by `int` (because `int()` will return 0 as well).

Stephan202 2009-11-02 20:38:14

wow never knew that !

zzzeek 2009-11-02 20:42:59

Answer 3

A:

I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)

JSBangs 2009-11-02 20:22:57

Answer 4

+1 A:

I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.

How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.

If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.

Jamie Macey 2009-11-02 20:23:41

Answer 5

A:

Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.

Jerry Coffin 2009-11-02 20:25:52

Answer 6

+2 A:

A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.

With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.

Aric TenEyck 2009-11-02 20:25:53

Depending on the hashing algorithm, a hash could also be O(n). Or better, or worse.

Annabelle 2009-11-02 20:54:05

A decent hash algorithm is always O(n), where `n` is the length of the word, so the big-O complexity is basically the same, just happening in different parts of the algorithm.

Jerry Coffin 2009-11-02 23:19:25

Like they said, you'd have to at least scan through the linear word to hash it or traverse the trie with it, so there'd be an O(n) operation either way.

erjiang 2009-11-03 16:43:50

The more I think about it, the more I realize that yeah, a hash algorithm will also be O(n). However, a good hash algorithm on a modern processor can handle 64 bits of the key at a time, while a trie will handle one or maybe two bytes of the key at a time.

Aric TenEyck 2009-11-03 17:28:14

Answer 7

+1 A:

I think a trie with the count as the leaves could be faster.

Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.

A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.

For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.

If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.

However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.

Ben S 2009-11-02 20:26:24

Answer 8

A:

Use Python!

Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.

This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)

mduvall 2009-11-02 20:27:21

Answer 9

A:

To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?

Dan Blanchard 2009-11-02 20:29:43

ansaurus

tags:

views:

answers:

Optimizing word count

related questions