I have to read words from a 10 G file and put them in a sorted manner of their frequency, how can I achieve this in most efficient way?
views:
69answers:
4
+1
A:
Use a database.
Otherwise you'll just end up creating a subset of a database anyway.
Richard
2010-10-14 15:51:24
are you serious? this is completely algorithmic question that can be solved using code and memory, why to use database?
Andrey
2010-10-14 15:51:54
@Andrey: Because (1) databases are designed to do this kind of thing (10G is not a big database), (2) if that is 10G of unique words then handling a hash of billions of entries doesn't have to be solved (i.e. building an indexed disk file of some type aka a database).
Richard
2010-10-14 15:58:30
from (1) it is necessary concluded that you should use it. passing 10GB over network will take significant amount of time and other users on network will not appreciate it. (2) who said unique? If it is English, then there are maximum 1M words (http://hypertextbook.com/facts/2001/JohnnyLing.shtml) not serious amount for hash. for me task feels as easy as it sounds, but your solution is overengineered for it.
Andrey
2010-10-14 16:02:55
@Andrey who said anything about network (I would certainly not cross the network without thought -- but see Hadoop answer). Who said English? 10G of text for this problem is of the order of the Library of Congress => unlikely to be human language, but could be something like genetic data which certainly could be all unique... or sufficiently few duplicates, even 1000000000 entries (average count 10) will cause memory issues on common platforms.
Richard
2010-10-14 16:13:37
Looks like a misremembered the Library of Congress (should be 10TB... only out by 1000x). Never the less it is an enormous amount of text.
Richard
2010-10-14 16:19:06
+3
A:
create a Hash that will map Word -> # of occurrences. Then populate it. After that convert to array and sort.
Andrey
2010-10-14 15:51:34
@Vishal TreeMap might not be best option. "The map is sorted according to the natural ordering of its keys" (http://download.oracle.com/javase/6/docs/api/java/util/TreeMap.html) in your case order of keys (words) is irrelevant. order of values is important, but it is not stable because it changes over time.
Andrey
2010-10-14 16:05:39
@Vishal you read file char by char, when you see end of work (space or punctuation sign) you consider word finished, convert it to lower/upper case and increment frequency (depends on algorithm you pick)
Andrey
2010-10-14 16:06:35