views:

57

answers:

3

Can anyone recommend a fast way to sort the contents of a text file, based on the first X amount of characters of each line? For example if i have in the text file the following text

Adrian Graham   some more text here

John Adams     some more text here

Then another record needs to be inserted for eg.

Bob Something some more text here

I need to keep the file sorted but this is a rather big file and i'd rather not load it entirely into memory at once. By big i mean about 500 000 lines, so perhaps not terribly huge.

I've had a search around and found http://www.codeodor.com/index.cfm/2007/5/14/Re-Sorting-really-BIG-files---the-Java-source-code/1208 and i wanted to know if anyone could suggest any other ways? For the sake of having second opinions?

My initial idea before i read the above linked article was:

Read the file

Split it into several files, for eg A to Z

If a line begins with "a" then it is written to the file called A.txt

Each of the files then have their contents sorted (no clear idea how just yet apart from alphabetical order)

Then when it comes to reading data, i know that if i want to find a line which starts with A then i open A.txt When inserting a new line the same thing applies and i just append to the end of the file. Later after the insert when there is time i can invoke my sorting program to reorder the files that have had stuff appended to them.

I realise that there are a few flaws in this like for eg. there won't be an even number of lines that start with a particular letter so some files may be bigger than others etc.

Which again is why i need a second opinion for suggestions on how to approach this? The current program is in java but any programming language could be used for an example that would achieve this...I'll port what i need to.

(If anyone's wondering i'm not deliberately trying to give myself a headache by storing info this way, i inherited a painful little program which stores data to files instead of using some kind of database) Thanks in advance

+1  A: 

500,000 shouldn't really be that much to sort. Read the whole thing into memory, and then sort it using standard built in functions. I you really find that these are too slow, then move onto something more complicated. 500,000 lines x about 60 bytes per line still only ends up being 30 megs.

Kibbee
That's true but its a growing file and i wanted to take future sizes into account.
robinsonc494
perhaps it's premature optimization to try to solve a problem that doesn't exist. Probably better to monitor the application. And wait to see when speed and/or memory usage becomes a problem, and fix it then.
Kibbee
point taken on the premature optimization but the fact still remains that i need to at least sort the data that already exists
robinsonc494
+2  A: 

You may also want to simply call the DOS "sort" command to sort the file. It is quick and will require next to no programming on your part.

In a DOS box, type help sort|more for the sort syntax and options.

BoltBait
that'd be a nice trick but its ran on a ubuntu desktop...although my dev environment is windows :)
robinsonc494
@robinsonc494, I'm no expert in ubuntu, but doesn't it have a similar command?
BoltBait
probably does, i'll have a look into it, thanks. I don't use linux that often as a desktop env. so i'm no expert either but i will check it out in a minute and see if they have such a thing. thanks again
robinsonc494
A: 

Another option might be to read the file and put it in a lightweight db (for example hsqldb in file mode)

Then get the data sorted, and write it back to a file. (Or simply migrate to program, so it uses a db)

davyM
i might give this a shot although my last encounter with hsqldb wasn't too pleasant.
robinsonc494