views:

213

answers:

4

Hello!

I hava a file, which consists of a one row:

 1 , 1 2 , 1 3 6 , 4 ,...

In this representation, spaces separate the integers and commas. This string is so huge that I can't read it with RandomAccessFile.readLine() (almost 4 Gb needed). So that I created a buffer, which can contain 10 integers. My task is to sort all integers in the string.

Could you, please, help?

EDIT

@Oscar Reyes

I need to write some sequences of integers to a file and then to read from it. Actually I don't know, how to do it. I'm a newbie. So I decided to use chars to write integers, delimiters between integers are ",", and delimeters between sequences are "\n\r" which. So that I created a monster that reads it:

public BinaryRow getFilledBuffer(String filePath, long offset) throws IOException{
    mainFile = new RandomAccessFile(filePath, "r");

    if (mainFile.length() == 0){
        return new BinaryRow();
    }

    StringBuilder str = new StringBuilder();

    mainFile.seek(mainFile.length()-4); //that is "\n" symbol
    char chN = mainFile.readChar();

    mainFile.seek(offset);
    int i = 0;
    char nextChar = mainFile.readChar();
    while (i < 11 && nextChar != chN){
        str.append(nextChar);
        if (nextChar == ','){
            i++;
            if (i == 10){
                break;
            }
        }
        nextChar = mainFile.readChar();
    }

    if (nextChar == chN){
        position = -1;
    }else{
        position = mainFile.getFilePointer();
    }

    BinaryRow br = new BinaryRow();

    StringBuilder temp = new StringBuilder();

    for (int j = 0; j < str.length(); j++){
        if ((str.charAt(j) != ',')){
            temp.append(str.charAt(j));
            if (j == str.length() - 1){
                br.add(Integer.parseInt(temp.toString()));
            }   
        }else{
            br.add(Integer.parseInt(temp.toString()));
            temp.delete(0, temp.length());
        }
    }


    mainFile.close();
    return br;

}

If you could advise how to do it, please do it =)

+1  A: 

Read it to memory in chunks (100 MB each?), one chunk at a time, sort it and save to disk.

Then open all the ordered chunks, read the first element of each, and append the lowest to the output. Then read the next element of the chunk you just read from and repeat.

When merging you can keep an array of the last int read from each chunk and just iterate over it to get the lowest. Then you substitute the value you just used with the next element in the chunk it was taken from.

example with chunks [1, 5, 16] [2, 9, 14] [3, 8, 10]
array [(1), 2, 3], lowest 1 --> to output
      [5, (2), 3], lowest 2 --> to output
      [5, 9, (3)], lowest 3 -->
      [(5), 9, 8],        5
      [16, 9, (8)],       8
      [16, (9), 10],      9 
...
Utaal
If I am not mistaken also I will have to create some kind of index array. On the other hand, one chunk may contain integers 1, 200, 500, another 2, 100, 300 ...
Dmitry
@Dmitry: Indeed, it would be better if you implement QuickSort which uses a pivot to overcome that detail.
OscarRyz
I've added an example of the merging process
Utaal
@Oscar: if I'm not mistaken you are proposing a quicksort on the array for merging: I think that you can just iterate through the 40? elements (4 GB / 100 MB) it shouldn't degrade performance and you'll get a much simpler method to substitute the value you just used with the next on the same chunk
Utaal
The usage of memory in the merging process is *very* limited: in fact you can keep in memory just the array (its size corresponds to N integers with N = number of chunks, that translates in little more than 1 KB - as I don't know how they are stored in Java I'm assuming 32 bit integers)You certainly don't have to load the whole chunks in memory, just open all of them and read one integer at a time in the array.
Utaal
Of course, to create the temporary sorted files (which is the 1st step ) you'll have to read a chunk of the original file in memory (100MB at a time?), sort it, save to disk and proceed with the next chunk till exhaustion of source file.
Utaal
Thank you, I'll think about it
Dmitry
+12  A: 

This is exactly the origin QuickSort back then there was not enough RAM to sort in memory so they procedure is to store partial results in disk.

So what you can do is:

  1. Pick a pivot.
  2. Read sequentially your file and store data lower than pivot in temp_file_1 and data bigger or equal to the pivot in temp_file_2
  3. Repeat the procedure in temp_file_1 and append the result to result_file
  4. Repeat the procedure for temp_file_2 and append the result to result_file

When parts are small enough ( like 2 just direct swap them Enough to be sorted in memory )

This way you'll be able to sort in chunks and store the partial results in temp files and you'll have a final file with the result sorted.

EDIT I told you a quick sort was possible.

It seems like you would need some extra space for the temp files after all.

Here's what I did.

I create a 40 mb file with numbers separated by commas.

I name it input:

input

Input is 40mb

During the sort, the tmp files with the buckets of "greater than", "lower than" values are created and when the sort is finished, the values are sent to a file called ( guess what ) output

processing

Temp files are created with the partial results

Finally all the tmp files are deleted and the result is kept in the file "output" with the correct sorted sequence of numbers:

output

Finally the file "output" is created, notice it is 40 mb too

Here's the full program.

import java.io.*;
import java.util.*;

public class FileQuickSort {

    static final int MAX_SIZE = 1024*1024*16; // 16 megabytes in this sample, the more memory your program has, less disk writing will be used. 
    public static void main( String [] args ) throws IOException {
        fileQuickSort( new File("input"), new File("output"));
        System.out.println();
    }

    //
    static void fileQuickSort( File inputFile, File outputFile ) throws IOException {
        Scanner scanner = new Scanner( new BufferedInputStream( new FileInputStream( inputFile ), MAX_SIZE));
        scanner.useDelimiter(",");

        if( inputFile.length() > MAX_SIZE && scanner.hasNextInt()) {
            System.out.print("-");

            // put them in two buckets... 
            File lowerFile = File.createTempFile("quicksort-","-lower.tmp",new File("."));
            File greaterFile = File.createTempFile("quicksort-","-greater.tmp", new File("."));
            PrintStream  lower   = createPrintStream(lowerFile);
            PrintStream greater  = createPrintStream(greaterFile);
            PrintStream target = null;
            int pivot = scanner.nextInt();

            // Read the file and put the values greater than in a file 
            // and the values lower than in other 
            while( scanner.hasNextInt() ){
                int current = scanner.nextInt();

                if( current < pivot ){
                    target = lower;
                } else {
                    target = greater;
                }
                target.printf("%d,",current);
            }
            // avoid dropping the pivot
            greater.printf("%d,",pivot);
            // close the stream before reading them again
            scanner.close();
            lower.close();
            greater.close();
            // sort each part
            fileQuickSort( lowerFile , outputFile );
            lowerFile.delete();
            fileQuickSort( greaterFile   , outputFile);
            greaterFile.delete();

            // And you're done.
        } else {

            // Else , if you have enough RAM to process it
            // 
            System.out.print(".");
            List<Integer> smallFileIntegers = new ArrayList<Integer>();
            // Read it
            while( scanner.hasNextInt() ){
                smallFileIntegers.add( scanner.nextInt() );
            }
            scanner.close();

            // Sort them in memory 
            Collections.sort( smallFileIntegers );

            PrintStream out = createPrintStream( outputFile);
            for( int i : smallFileIntegers ) {
                out.printf("%d,",i);
            }
            out.close();
            // And your're done
        }
    }
    private static PrintStream createPrintStream( File file ) throws IOException {
        boolean append = true;
        return new PrintStream(  new BufferedOutputStream( new FileOutputStream( file, append )));
    }
}

The format of the files is number,number,number,number

Your current format is: n u m b e r , n u m b , b e r

To fix that you just have to read it all and skip the blanks.

Add another question for that.

OscarRyz
yes, it's like creating a tree. I know, it's maybe the only way to do it, but there would be a ;ot of files...
Dmitry
Not really... I mean you don't need to necessarily create 1 gb of files. You just do it until you can perform in memory sort.
OscarRyz
+1 if for no other reason than the first effective use I have *ever* seen of translucent windows. Kudos. Also you put a lot of work into this good answer.
Carl Manaster
This is a darn fine answer, nice work
stimms
Why on Earth would you use an on-disk quicksort? Am I missing something here?My take would be to read in memory-sized chunks, sort in RAM, write to temp files. Once you've got the file processed and sorted in chunks merge-sort the chunks into the output file.
JUST MY correct OPINION
Dmitry
@ttmrichter: Yes, actually that's another possibility. I would like to know how to do that. If is in your possibilities, could you post the working code to do that? Utaal, suggested something similar too, but I didn't knew how to implement it. I did this just to know if what I was talking about was feasible and/or possible.
OscarRyz
@Dmitry: Erhm.. what? That line it is supposed to be to compare if the file length goes beyond your "in memory" capabilities" If it does, you split it in two and do it again. What remaining part are you talking about ( I don't quite get you ) Probably you're referring to what happens when is below MAX_SIZE? If that case you use "In memory sort". I set here MAX_SIZE to aprox. 16 mb, in production you could set up to 1 gb or so, depending on who many RAM do you have.
OscarRyz
@Dmtry: BTW, this is not a "copy/paste and put in your production code" post. This is more like a sample I wanted to do of what I was suggesting. There might be bugs in the code ( for instance, I'm dropping the pivot each time from the sorting :P ) but that was intentional as I was trying to avoid too many validations in a single piece of code.
OscarRyz
yes, I understood =)
Dmitry
@Oscar:I wrote another algorithn which is base on your idea. But both these algorithms(yours and mine) dont' work peroperly - they give sorted blocks of numbers. For example: 1 2 3 200 201 202 50 52 52I'll post mine algorithm soon
Dmitry
@Dmitry: Really? It shouldn't. Mmhh probably the problem is, the output always get appended to the `output` file ( I didn't create a new one or anything). So if you run it 3 times with the input `5,1,3` you'll have `1,3,5,1,3,5,1,3,5` in the output file. BTW to avoid dropping the pivot you just have to add: `greater.printf("%d,",pivot);` after the split.
OscarRyz
Here's a sample input: http://pastebin.com/D5mRY0wD
OscarRyz
And the output for that input: http://pastebin.com/qVd6TPCN
OscarRyz
@Oscar: http://en.wikipedia.org/wiki/External_sorting has a good overview of doing sorts on disk. The very first one there -- external merge sort -- is the one I was talking about.
JUST MY correct OPINION
@ttmrichter: Great, I'll take a look at that
OscarRyz
A: 

moving this to the original question

Dmitry
A: 

@Oscar you add in scanner, which stands for buffer, some integers, work with them, but what about the remaining? I mean inputFile.length() - MAX_SIZE??? It seems to me that you foget about them...

Dmitry
Those are covered in the "else" part ( isn't? ) If is not, you have a golden opportunity to fix it and understand it in the process. It was not my intention to do your work ( because I don't know your constraints ) but just provide a working sample of what was I saying.
OscarRyz
Yes, of course, thank you =))
Dmitry