views:

178

answers:

2

Hi,

I'm trying to implement an i/o intensive quicksort (C++ qsort) on a very large dataset. In the interests of speed, I'd like to read in a chunk of data at a time into a buffer and then use qsort to sort it inside the buffer. (I am currently working with text files but would like to move to binary soon.) However, my data is composed of variable-length records, and qsort needs to be told the length of the record in order to sort. Is there any way to standardize this? The only thing I could think of was rather convoluted: my program currently reads from the buffer until it hits a linefeed character ('10' in ascii), transferring each character over to another array. When it finds a linefeed (the delimiter in the input file), it fills the number of spaces remaining in the buffer for that record (record size is set to 30) with null characters. This way, I should end up with a buffer full of fixed-size records to give qsort.

I know there are several problems with my approach, one being that it's just clumsy, another that the record size might conceivably be larger than 30, but is generally much less. Is there a better way of doing this?

As well, my current code doesn't even work. When I debug it, it seems to be transferring characters from one buffer to the other, but when I try to print out the buffer, it contains only the first record.

Here is my code:

FILE *fp;
unsigned char *buff;
unsigned char *realbuff;
FILE *inputFiles[NUM_INPUT_FILES];
buff = (unsigned char *) malloc(2048);
realbuff = (unsigned char *) malloc(NUM_RECORDS * RECORD_SIZE);

fp = fopen("postings0.txt", "r");
if(fp)
{
    fread(buff, 1, 2048, fp);


    /*for(int i=0; i <30; i++)
     cout << buff[i] <<endl;*/

    int y=0;
    int recordcounter = 0;

    //cout << buff;
    for(int i=0;i <100; i++)
    {
        if(buff[i] != char(10))
        {
            realbuff[y] = buff[i];
            y++;
            recordcounter++;
        }        
        else
        {
            if(recordcounter < RECORD_SIZE)
                for(int j=recordcounter; j < RECORD_SIZE;j++)
                {
                    realbuff[y] = char(0);
                    y++;
                }
            recordcounter = 0;
        }
    } 

    cout << realbuff <<endl;   
    cout << buff;
}
else 
    cout << "sorry";

Thank you very much, bsg

+1  A: 

The qsort function can only work on fixed length record (like you say). In order to sort variable length records, you need an array of pointers to them and then have qsort sort the array of pointers. This may be more efficient too, as pointers are much faster to move around than large chunks of data are.

The same goes for std::sort, which would be recommended because it is type safe. Just be sure to supply a comparison predicate (a less than function) taking pointers as its arguments as the third parameter.

Tronic
Thanks for the suggestion. I created an array of pointers and pointed them to the beginning of each record, but because they're in an array of characters, each pointer points to the entire array starting from where it points. So when it sorts, and I want to print out the array, it prints the whole array several times over. How can I make each pointer point to only one record? Again, each one points to the beginning of a record, but it thinks that the rest of the buffer is part of the string it's pointing to, as well. Thanks, bsg.
bsg
A: 

How about using c++ file streams for parsing your file ?

Checkout this example(website name is strange, no offense!!) which returns the record as a STL vector and then you can use STL Sort algorithm.

RP