Sorting a very large text file in Java

views:

567

answers:

+1 Q:

Sorting a very large text file in Java

Hi, I have a large text file I need to sort in Java. The format is:

word [tab] frequency [new line]

The algorithm for sorting is:

Read some of the file, filtering for purlely alphabetic words.
Once you have X number of alphabetic words, call Collections.sort and write the result to a file.
Repeat until you have finished reading the file.
Start reading two sorted files, comparing line by line for the word with higher frequency, and writing at the same time to a new file as to not load much into your memory
Repeat until all files are merged into one large file

Right now I've divided the large file into smaller ones (sorted by descending frequency) with 10,000 lines each. I know I need to somehow merge these files back together, but I'm not sure how to go about this.

I've created a LinkedList to keep track of all the files created. The algorithm says to compare each line in the two files, except I've tried a case where , say file1 = 8,6,5,3,1 and file2 = 9,8,8,8,8. Then if I compare them line by line I would get file3 = 9,8,8,6,8,5,8,3,8,1 which is incorrectly sorted (they should be in decreasing order).

I think I'm misunderstanding some part of the algorithm. If someone could point out what I should do instead, I'd greatly appreciate it. Thanks.

edit: Yes this is an assignment. We aren't allowed to increase memory unfortunately :(

+3 A:

You have the right idea, but with a small error. As you read the lines from the 2 files, you shouldn't be outputting both lines, because the next line in the file with the greater number might still be greater than the first line in the file with the smaller number (as it is in your test case).

So, it's quite simply this:

Read a line from each file to start.
Then repeat this:
.The line with the highest value is written to a new file
.Read another line from that file only

This is the basic algorithm, but of course you have to allow for what happens when one of the files runs out (in which case you just read lines and output from the remaining file - whether this is a separate loop or part of the same loop is up to you - I would look at what the code looks like before making that decision).

Tony van der Peet 2009-12-05 20:06:17

+1 on Tony's response. The algorithm you need to follow is the "Merge" portion of a Merge Sort (http://en.wikipedia.org/wiki/Merge_sort) where you merge two sorted arrays/lists.

djunforgetable 2009-12-05 21:20:08

Ohh ok, I think I know what needs to be done now. Thanks!

Alice 2009-12-05 22:01:16

Actually, I think you are going to encounter another problem at a later stage of answering this question. That is when you have the same word in two different files with different frequencies. This is going to be a slightly harder one to answer...

Tony van der Peet 2009-12-05 22:54:54

Knuth, volume 2, "Sorting and Searching".

Loadmaster 2009-12-05 20:21:34

If the file is too large to fit into memory, use a database. Something like MySQL might be too heavyweight, but there are embeddable databases you can use in java.

One of them is berkely DB which is a Key/value database system.

Apache Derby is a relational database system that lets you use SQL.

If you already know SQL, derby might be the easiest way to go. I haven't used it myself.

Chad Okere 2009-12-05 21:09:51

ansaurus

tags:

views:

answers:

Sorting a very large text file in Java

related questions