tags:

views:

194

answers:

6

I've a program that does Block Nested loop join (link text). Basically what it does is, it reads contents from a file (say 10GB file) into buffer1 (say 400MB), puts it into a hash table. Now read contents of the second file (say 10GB file) into buffer 2 (say 100MB) and see if the elements in buffer2 are present in the hash. Outputting the result doesn't matter. I'm just concerned with efficiency of the program for now. In this program, I need to read 8 bytes at a time from both files so I use long long int. The problem is my program is very inefficient. How can I make it efficient ?

// I compile using g++ -o hash hash.c -std=c++0x

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <stdint.h>
#include <math.h>
#include <limits.h>
#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_map>
using namespace std;

typedef std::unordered_map<unsigned long long int, unsigned long long int> Mymap; 
int main()
{

uint64_t block_size1 = (400*1024*1024)/sizeof(long long int);  //block size of Table A - division operator used to make the block size 1 mb - refer line 26,27 malloc statements.
uint64_t block_size2 = (100*1024*1024)/sizeof(long long int);   //block size of table B

int i=0,j=0, k=0;
uint64_t x,z,l=0;
unsigned long long int *buffer1 = (unsigned long long int *)malloc(block_size1 * sizeof(long long int));
unsigned long long int *buffer2 = (unsigned long long int *)malloc(block_size2 * sizeof(long long int));

Mymap c1 ;                                                          // Hash table
//Mymap::iterator it;

FILE *file1 = fopen64("10G1.bin","rb");  // Input is a binary file of 10 GB
FILE *file2 = fopen64("10G2.bin","rb");

printf("size of buffer1 : %llu \n", block_size1 * sizeof(long long int));
printf("size of buffer2 : %llu \n", block_size2 * sizeof(long long int));


while(!feof(file1))
        {
        k++;
        printf("Iterations completed : %d \n",k);
        fread(buffer1, sizeof(long long int), block_size1, file1);                          // Reading the contents into the memory block from first file

        for ( x=0;x< block_size1;x++)
            c1.insert(Mymap::value_type(buffer1[x], x));                                    // inserting values into the hash table

//      std::cout << "The size of the hash table is" << c1.size() * sizeof(Mymap::value_type) << "\n" << endl;

/*      // display contents of the hash table 
            for (Mymap::const_iterator it = c1.begin();it != c1.end(); ++it) 
            std::cout << " [" << it->first << ", " << it->second << "]"; 
            std::cout << std::endl; 
*/

                while(!feof(file2))
                {   
                    i++;                                                                    // Counting the number of iterations    
//                  printf("%d\n",i);

                    fread(buffer2, sizeof(long long int), block_size2, file2);              // Reading the contents into the memory block from second file

                    for ( z=0;z< block_size2;z++)
                        c1.find(buffer2[z]);                                                // finding the element in hash table

//                      if((c1.find(buffer2[z]) != c1.end()) == true)                       //To check the correctness of the code
//                          l++;
//                  printf("The number of elements equal are : %llu\n",l);                  // If input files have exactly same contents "l" should print out the block_size2
//                  l=0;                    
                }
                rewind(file2);
                c1.clear();                                         //clear the contents of the hash table
    }

    free(buffer1);
    free(buffer2);  
    fclose(file1);
    fclose(file2);
}

Update :

Is it possible to directly read a chunk (say 400 MB) from a file and directly put it into hash table using C++ stream readers? I think that can further reduce the overhead.

A: 

The only way to know is to profile it, eg with gprof. Create a benchmark of your current implementation and then experiment with other modifications methodically and re-run the benchmark.

the_mandrill
+2  A: 

The running time for your program is (l1 x bs1 x l2 x bs2) (where l1 is the number of lines in the first file, and bs1 is the block size for the first buffer, and l2 is the number of lines in the second file, and bs2 is the block size for the second buffer) since you have four nested loops. Since your block sizes are constant, you can say that your order is O(n x 400 x m x 400) or O(1600mn), or in the worst case O(1600n2) which essentially ends up being O(n2).

You can have an O(n) algorithm if you do something like this (pseudocode follows):

map = new Map();
duplicate = new List();
unique = new List();

for each line in file1
   map.put(line, true)
end for

for each line in file2
   if(map.get(line))
       duplicate.add(line)
   else
       unique.add(line)
   fi
end for

Now duplicate will contain a list of duplicate items and unique will contain a list of unique items.

In your original algorithm, you needlessly traverse the second file for every line in the first file. So you actually end up losing the benefit of the hash (which gives you O(1) lookup time). The trade-off in this case, of course, is that you have to store the entire 10GB in memory which probably is not that helpful. Usually in cases like these the trade-off is between run-time and memory.

There is probably a better way to do this. I need to think about it some more. If not, I'm pretty sure someone will come up with a better idea :).

UPDATE

You can probably reduce memory-usage if you can find a good way to hash the line (that you read in from the first file) so that you get a unique value (i.e., a 1-to-1 mapping between the line and the hash value). Essentially you would do something like this:

for each line in file1
   map.put(hash(line), true)
end for

for each line in file2
   if(map.get(hash(line)))
       duplicate.add(line)
   else
       unique.add(line)
   fi
end for

Here the hash function is the one that performs the hashing. This way you don't have to store all the lines in memory. You only have to store their hashed values. This might help you a little bit. Even still, in the worse case (where you are either comparing two files that are identical, or entirely different) you can still end up with 10Gb in memory for either duplicate or unique list. You can get around with it with the loss of some information if you simply store a count of unique or duplicate items instead of the items themselves.

Vivin Paliath
I get your point but it seems like very memory inefficient.
Sunil
@Sunil yup, it is (unless you store the hashed values, in which case you can reduce memory costs). As I mentioned, that's usually the trade-off. Speed vs. memory. In your solution you use very little memory at the expense of speed. In my (original) solution my runtime is low but with higher memory usage. For large datasets nested loops usually have a very high runtime.
Vivin Paliath
+1  A: 

long long int *ptr = mmap() your files, then compare them with memcmp() in chunks. Once a discrepancy is found, step back one chunk and compare them in more detail. (More detail means long long int in this case.)

If you expect to find discrepancies often, do not bother with memcmp(), just write your own loop comparing the long long ints to each other.

Amigable Clark Kant
A: 

I'd bet if you read in larger chunks you'd get better performance. fread() and Process multiple blocks per pass.

Jay
Of course but I want to use only 8 bytes. Wouldn't it be faster if I use ifstream() instead of fread()? The main point I'm trying to make is my read functions and map functions are very slow and I would appreciate suggestions to improve on that.Thanks
Sunil
If you call fread fewer times then you remove the overhead of setting up and tearing down for each call you remove. Since you're doing that a LOT of times it will have a significant impact. 10 gb / 8 bytes = the overhead of 1.25 billion calls removed.
Jay
A: 

The problem I see is that you are reading the second file n-times. Really slow.

The best way to make this faster is to pre-sort the files then do a Sort-merge join. The sort is almost always worth it, in my experience.

Jeff Walker
I know but that is the whole point of Block Nested Loop Join algorithm.
Sunil
I guess what I'm saying is not to use a Block Nested Loop join, unless you can't do it any other way. The Nested Loop join is a last-resort type of algorithm. I know nothing about your data, but there is usually a way to sort the data, so that you can use a more reasonable join algorithm.
Jeff Walker
@jeff: I see what you are talking about. The problem is not to find another efficient algorithm but to use Block Nested Loop Join and to code this program correctly so that it works efficiently.
Sunil
+3  A: 

If you're using fread, then try using setvbuf(). The default buffers used by the standard lib file I/O calls are tiny (often of the order of 4kB). When processing large amounts of data quickly, you will be I/O bound and the overhead of fetching many small buffer-fuls of data can become a significant bottleneck. Set this to a larger size (e.g. 64kB or 256kB) and you can reduce that overhead and may see significant improvements - try out a few values to see where you get the best gains as you will get diminishing returns.

Jason Williams
Seems interesting. Will try and post back the results.
Sunil