The simple answer is that there is no simple answer to this question. There are lots of answers, most of them fairly complex -- Knuth volume 3 (for one example) devotes a great deal of space to it.
One thing that becomes obvious when looking through what's been done is that you really want to minimize the number of files you create during your initial sorting, and maximize the length of each. To do that, you generally want to read in about as much data as you can fit in memory, but instead of just sorting it and writing it out, you want to put it into a heap. Then as you write each record out, you read IN another record and put it into your heap. As you write each subsequent record from the heap to the file, you check whether it's larger than the existing records. If not, you remove it from the heap, and insert it into another heap. Then continue with the next smallest record in the first heap. You stop putting records into the current file when the first heap is completely empty, and your second heap is taking up all your memory. At that point, you start putting records into a new file, and basically "swap" the uses of the two heaps.
This will produce considerably longer intermediate files in the initial phase, so merging them is substantially less work.
Edit: I certainly didn't invent this -- I probably first read about it in Knuth, but perhaps in Algorithms + Data Structures = Programs (Niklaus Wirth) -- both discuss it. Knuth credits first publication of the method to "H. Seward", in his masters thesis at MIT in 1954. If you have the second edition of Knuth, it's on page 254 of volume 3. I've never gotten around to getting a copy of the third edition, so I don't have a page number for that.