If I have a relation (SQL) that does not fit in memory and I want to sort the relation using TPMMS (Two-pass multi-way merge sort method). How would I divide the table in sub tables (and how many) that can fit in memory and than merge them? Let's say I am using C#.
A:
I've not hunted down the current definition of a two-pass multi-way merge sort, but the theory of 'external sorting' (where the data is too big to fit in memory) is pretty much standard. Any decent book on algorithms will cover it; amongst many others, you could look at Knuth, Sedgewick, or (for the software archaeologists) Kernighan & Plauger Software Tools.
The basic technique is simple:
- Read data until there is no space left.
- Sort it.
- Write to a temporary file.
- Repeat from 1 until there is no data left to read.
- You know know how many temporary files you have, N.
- You need to determine how many of those files you can read at one time, M.
- If N > M, then you design your merging phase so that the last phase will merge M files.
- You merge sets of M files into new temporary files until you reach the last merge.
- You merge the final set of M files (or N if N < M) writing to final destination.
All dreadfully standard - but there are nit-picky details to get right.
There was a good article in AT&T's Unix System Readings Volume II called 'Theory and Practice in the Construction of a Working Sort Routine' that you should find and read if you are serious about learing how to handle external sorts.
Jonathan Leffler
2009-05-21 15:59:15