How can I efficiently read words from a 10 G file and put them in order of their frequency?

tags:

file-io
java

views:

answers:

+1 Q:

How can I efficiently read words from a 10 G file and put them in order of their frequency?

I have to read words from a 10 G file and put them in a sorted manner of their frequency, how can I achieve this in most efficient way?

+1 A:

Hadoop WordCount example

MikeG 2010-10-14 15:51:14

+1 A:

Use a database.

Otherwise you'll just end up creating a subset of a database anyway.

Richard 2010-10-14 15:51:24

are you serious? this is completely algorithmic question that can be solved using code and memory, why to use database?

Andrey 2010-10-14 15:51:54

@Andrey: Because (1) databases are designed to do this kind of thing (10G is not a big database), (2) if that is 10G of unique words then handling a hash of billions of entries doesn't have to be solved (i.e. building an indexed disk file of some type aka a database).

Richard 2010-10-14 15:58:30

from (1) it is necessary concluded that you should use it. passing 10GB over network will take significant amount of time and other users on network will not appreciate it. (2) who said unique? If it is English, then there are maximum 1M words (http://hypertextbook.com/facts/2001/JohnnyLing.shtml) not serious amount for hash. for me task feels as easy as it sounds, but your solution is overengineered for it.

Andrey 2010-10-14 16:02:55

@Andrey who said anything about network (I would certainly not cross the network without thought -- but see Hadoop answer). Who said English? 10G of text for this problem is of the order of the Library of Congress => unlikely to be human language, but could be something like genetic data which certainly could be all unique... or sufficiently few duplicates, even 1000000000 entries (average count 10) will cause memory issues on common platforms.

Richard 2010-10-14 16:13:37

Looks like a misremembered the Library of Congress (should be 10TB... only out by 1000x). Never the less it is an enormous amount of text.

Richard 2010-10-14 16:19:06

+3 A:

create a Hash that will map Word -> # of occurrences. Then populate it. After that convert to array and sort.

Andrey 2010-10-14 15:51:34

I can use TreeMap to keep a count on word, how about reading the file?

Vishal 2010-10-14 16:01:43

@Vishal TreeMap might not be best option. "The map is sorted according to the natural ordering of its keys" (http://download.oracle.com/javase/6/docs/api/java/util/TreeMap.html) in your case order of keys (words) is irrelevant. order of values is important, but it is not stable because it changes over time.

Andrey 2010-10-14 16:05:39

@Vishal you read file char by char, when you see end of work (space or punctuation sign) you consider word finished, convert it to lower/upper case and increment frequency (depends on algorithm you pick)

Andrey 2010-10-14 16:06:35

+4 A:

I would use a Trie

dty 2010-10-14 16:14:05

ansaurus

tags:

views:

answers:

How can I efficiently read words from a 10 G file and put them in order of their frequency?

related questions