views:

440

answers:

6

Hi

I have a 5gig text file that needs to be sorted in alphabetical order What is the best algorithm to use?

constraints:

Speed - As fast as possible

Memory - A Pc with 1 Gig Ram running windows XP

A: 

Merge Sort is your best bet.

No Refunds No Returns
+1  A: 

I would say take a smaller subset of the data and try a few to see which work best, then go with that. This article might help you get started.

John Biesnecker
+1  A: 

What are the parameters of the sort? Do you have time constraints or space constraints? How close to ordered is the file already? Do you have to do it in one pass?

GrayWizardx
+9  A: 

I routinely sort text files >2GB with the sort linux command. Usually takes 15 - 30 seconds, depending on server load.

Just do it, it won't take as long as you think.

Update Since you're using Windows XP, you can get the sort command in UnxUtils. I use that one probably more than the linux version, and it's equally as fast.

The bottleneck for huge files really disk speed .. my server above has a fast sata raid. If your machine is a desktop (or laptop), then your 7200 RPM (or 5400) RPM IDE drives will add a few minutes to the job.

Seth
But be very careful to avoid treating the data as UTF-8 if you don't need that capability: in modern Linux (and cygwin) versions of sort UTF-8 comparisons slow the performance down by about 100 times.
Adrian Pronk
+4  A: 

For text files, sort, at least the GNU Coreutils version in Linux and others, works surprisingly fast.

Take a look at the --buffer-size and related options, and set --temporary-directory if your /tmp directory is too small.

Alternatively, if you're really worried how long it might take, you can split up the file into smaller chunks, sort then individually, then merge them together (with sort --merge). Sorting each chunk can be done on different systems in parallel.

ZoogieZork
Any sort worth its name will be doing the splitting and sorting and merging for you anyway. With only 1 GB RAM on the target machine, a 5 GB file will be sorted using a number of intermediate files which are merged together at the end.
Jonathan Leffler
Indeed. The point of the comment was the possibility of doing it in parallel across multiple machines.
ZoogieZork
A: 

How about importing the data into SQL Server using the Bulk Insert command?

link text

This gets the data into the SQL Server quite quickly and then allows you to perform all manner of efficient SQL Sorting based on the data imported.

You can also set this up as an automated task using SQL Server SSIS.

Brian Scott