tags:

views:

86

answers:

3

I am working on an application wherein i need to compare 10^8 entries(alphanumeric entries). To retrieve the entries from file( file size is 1.5 GB) and then to compare them, i need to take less than 5 minutes of time. So, what would b the effective way to do that, since, only retrieving time is exceeding 5 min. And i need to work on file only. please suggest a way out. I m working on windows with 3GB RAM n 100Gb hard disk.

+5  A: 
  • Read a part of the file, sort it, write it to a temporary file.
  • Merge-sort the resulting files.
Sjoerd
+1 for a simple, straightforward, can't argue with that answer!
Will A
A: 

If retrieving time is exceeding 5 min it seems that you need to look at how you are reading this file. One thing that has caused bad performance for me is that a C implementation sometimes uses thread-safe I/O operations by default, and you can gain some speed by using thread-unsafe I/O.

What kind of computer will this be run on? Many computers nowadays have several gigabytes of memory, so perhaps it will work to just read it all into memory and then sort it there (with, for example, qsort)?

Thomas Padron-McCarthy
my computer is having 3gb RAM n 100gb hard disk. need to run on windows.
suvirai
+1  A: 

Error handling and header includes are not included. You need to provide DataType and cmpfunc, samples are provided. You should be able to deduce the core workings from this snippet:

#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>

typedef char DataType; // is this alphanumeric?
int cmpfunc(char const *left, char const *right)
{
    return *right - *left;
}

int main(int argc, char **argv)
{
    int fd = open(argv[1], O_RDWR|O_LARGEFILE);
    if (fd == -1)
        return 1;
    struct stat st;
    if (fstat(fd, &st) != 0)
        return 1;
    DataType *data = mmap(NULL, st.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (!data)
        return 1;
    qsort(data, st.st_size / sizeof(*data), cmpfunc);
    if (0 != msync(data, st.st_size, MS_SYNC))
        return 1;
    if (-1 == munmap(data, st.st_size))
        return 1;
    if (0 != close(fd))
        return 1;
    return 0;    
}

I can't imagine you can get much faster than this. Be sure you have enough virtual memory address space (1.5GB is pushing it but will probably just work on 32bit Linux, you'll be able to manage this on any 64bit OS). Note that this code is "limited" to working on a POSIX compliant system.

In terms of C and efficiency, this approach puts the entire operation in the hands of the OS, and the excellent qsort algorithm.

Matt Joiner
Note that this is **not** standard according to C -- it will only work on POSIX operating systems. Note also that it will only work if your data type is fixed in size.
Billy ONeal
Also, `qsort` is not efficient when the data to be sorted are located on secondary storage -- it is not an *external* sorting algorithm.
Billy ONeal
@Billy ONeal: Of course. But in light of the OP, and the excellence and ubiquity of POSIX systems, the code is both enlightening and effective.
Matt Joiner
@Billy ONeal: But it's fine when you leave it up to the OS and `mmap`'s effective caching (again this would be OS dependent).
Matt Joiner
@Matt: Over 90% of consumer operating system installations are not POSIX. Assuming POSIX when that is not specified is wrong for a C question.
Billy ONeal
@Matt: It doesn't matter how effective the caching is -- Quicksort requires fast access to the entire data set in order to perform the partition operation. If you need external sorting, merge sort typically performs better.
Billy ONeal
@Billy ONeal: I pity non-POSIX users. I pity non-POSIX programmers even more. ;)
Matt Joiner
@Matt: That's fine, but you need to at least specify that you are writing POSIX code here, not C, in your answer. And I pity POSIX's unfortunate choice of naming everything as indecipherable gibberish because they are too lazy to type out more descriptive names. ;)
Billy ONeal
@Matt: @Billy is somewhat correct in that this will only be efficient if the entire file fits into physical memory - otherwise, `qsort()` will thrash it to and from storage, which will be highly *inefficient* (so if you change it to from "virtual address space" to just "memory", you'll be correct).
caf
@Billy ONeal: The short names of functions like `mmap()` are because they date from a time when some linkers only supported 6 significant characters in external symbols. Newer POSIX functions are named in a more verbose fashion (for example, [`sched_get_priority_max()`](http://opengroup.org/onlinepubs/007908775/xsh/sched_get_priority_max.html))
caf
-1 because the combination of *mmap* and *qsort* will access the file in a rather random order. This for sure is much slower than reading the file sequentially. And retrieving the file is obviously an issue.
Codo