tags:

views:

69

answers:

4

I have to read words from a 10 G file and put them in a sorted manner of their frequency, how can I achieve this in most efficient way?

+1  A: 

Hadoop WordCount example

MikeG
+1  A: 

Use a database.

Otherwise you'll just end up creating a subset of a database anyway.

Richard
are you serious? this is completely algorithmic question that can be solved using code and memory, why to use database?
Andrey
@Andrey: Because (1) databases are designed to do this kind of thing (10G is not a big database), (2) if that is 10G of unique words then handling a hash of billions of entries doesn't have to be solved (i.e. building an indexed disk file of some type aka a database).
Richard
from (1) it is necessary concluded that you should use it. passing 10GB over network will take significant amount of time and other users on network will not appreciate it. (2) who said unique? If it is English, then there are maximum 1M words (http://hypertextbook.com/facts/2001/JohnnyLing.shtml) not serious amount for hash. for me task feels as easy as it sounds, but your solution is overengineered for it.
Andrey
@Andrey who said anything about network (I would certainly not cross the network without thought -- but see Hadoop answer). Who said English? 10G of text for this problem is of the order of the Library of Congress => unlikely to be human language, but could be something like genetic data which certainly could be all unique... or sufficiently few duplicates, even 1000000000 entries (average count 10) will cause memory issues on common platforms.
Richard
Looks like a misremembered the Library of Congress (should be 10TB... only out by 1000x). Never the less it is an enormous amount of text.
Richard
+3  A: 

create a Hash that will map Word -> # of occurrences. Then populate it. After that convert to array and sort.

Andrey
I can use TreeMap to keep a count on word, how about reading the file?
Vishal
@Vishal TreeMap might not be best option. "The map is sorted according to the natural ordering of its keys" (http://download.oracle.com/javase/6/docs/api/java/util/TreeMap.html) in your case order of keys (words) is irrelevant. order of values is important, but it is not stable because it changes over time.
Andrey
@Vishal you read file char by char, when you see end of work (space or punctuation sign) you consider word finished, convert it to lower/upper case and increment frequency (depends on algorithm you pick)
Andrey
+4  A: 

I would use a Trie

dty