ansaurus

Question

Why is my program slow ? How can I improve its efficiency ?

Answer 1

A:

The only way to know is to profile it, eg with gprof. Create a benchmark of your current implementation and then experiment with other modifications methodically and re-run the benchmark.

the_mandrill 2010-10-05 21:43:08

Answer 2

+2 A:

The running time for your program is (l₁ x bs₁ x l₂ x bs₂) (where l₁ is the number of lines in the first file, and bs₁ is the block size for the first buffer, and l₂ is the number of lines in the second file, and bs₂ is the block size for the second buffer) since you have four nested loops. Since your block sizes are constant, you can say that your order is O(n x 400 x m x 400) or O(1600mn), or in the worst case O(1600n²) which essentially ends up being O(n²).

You can have an O(n) algorithm if you do something like this (pseudocode follows):

map = new Map();
duplicate = new List();
unique = new List();

for each line in file1
   map.put(line, true)
end for

for each line in file2
   if(map.get(line))
       duplicate.add(line)
   else
       unique.add(line)
   fi
end for

Now duplicate will contain a list of duplicate items and unique will contain a list of unique items.

In your original algorithm, you needlessly traverse the second file for every line in the first file. So you actually end up losing the benefit of the hash (which gives you O(1) lookup time). The trade-off in this case, of course, is that you have to store the entire 10GB in memory which probably is not that helpful. Usually in cases like these the trade-off is between run-time and memory.

There is probably a better way to do this. I need to think about it some more. If not, I'm pretty sure someone will come up with a better idea :).

UPDATE

You can probably reduce memory-usage if you can find a good way to hash the line (that you read in from the first file) so that you get a unique value (i.e., a 1-to-1 mapping between the line and the hash value). Essentially you would do something like this:

for each line in file1
   map.put(hash(line), true)
end for

for each line in file2
   if(map.get(hash(line)))
       duplicate.add(line)
   else
       unique.add(line)
   fi
end for

Here the hash function is the one that performs the hashing. This way you don't have to store all the lines in memory. You only have to store their hashed values. This might help you a little bit. Even still, in the worse case (where you are either comparing two files that are identical, or entirely different) you can still end up with 10Gb in memory for either duplicate or unique list. You can get around with it with the loss of some information if you simply store a count of unique or duplicate items instead of the items themselves.

Vivin Paliath 2010-10-05 21:46:57

I get your point but it seems like very memory inefficient.

Sunil 2010-10-05 21:59:11

@Sunil yup, it is (unless you store the hashed values, in which case you can reduce memory costs). As I mentioned, that's usually the trade-off. Speed vs. memory. In your solution you use very little memory at the expense of speed. In my (original) solution my runtime is low but with higher memory usage. For large datasets nested loops usually have a very high runtime.

Vivin Paliath 2010-10-05 22:02:55

Answer 3

+1 A:

long long int *ptr = mmap() your files, then compare them with memcmp() in chunks. Once a discrepancy is found, step back one chunk and compare them in more detail. (More detail means long long int in this case.)

If you expect to find discrepancies often, do not bother with memcmp(), just write your own loop comparing the long long ints to each other.

Amigable Clark Kant 2010-10-05 21:48:07

Answer 4

A:

I'd bet if you read in larger chunks you'd get better performance. fread() and Process multiple blocks per pass.

Jay 2010-10-05 21:49:30

Of course but I want to use only 8 bytes. Wouldn't it be faster if I use ifstream() instead of fread()? The main point I'm trying to make is my read functions and map functions are very slow and I would appreciate suggestions to improve on that.Thanks

Sunil 2010-10-05 21:52:47

If you call fread fewer times then you remove the overhead of setting up and tearing down for each call you remove. Since you're doing that a LOT of times it will have a significant impact. 10 gb / 8 bytes = the overhead of 1.25 billion calls removed.

Jay 2010-10-06 21:19:41

Answer 5

A:

The problem I see is that you are reading the second file n-times. Really slow.

The best way to make this faster is to pre-sort the files then do a Sort-merge join. The sort is almost always worth it, in my experience.

Jeff Walker 2010-10-05 21:50:54

I know but that is the whole point of Block Nested Loop Join algorithm.

Sunil 2010-10-05 21:54:19

I guess what I'm saying is not to use a Block Nested Loop join, unless you can't do it any other way. The Nested Loop join is a last-resort type of algorithm. I know nothing about your data, but there is usually a way to sort the data, so that you can use a more reasonable join algorithm.

Jeff Walker 2010-10-05 21:57:38

@jeff: I see what you are talking about. The problem is not to find another efficient algorithm but to use Block Nested Loop Join and to code this program correctly so that it works efficiently.

Sunil 2010-10-05 22:01:36

Answer 6

+3 A:

If you're using fread, then try using setvbuf(). The default buffers used by the standard lib file I/O calls are tiny (often of the order of 4kB). When processing large amounts of data quickly, you will be I/O bound and the overhead of fetching many small buffer-fuls of data can become a significant bottleneck. Set this to a larger size (e.g. 64kB or 256kB) and you can reduce that overhead and may see significant improvements - try out a few values to see where you get the best gains as you will get diminishing returns.

Jason Williams 2010-10-05 21:59:48

Seems interesting. Will try and post back the results.

Sunil 2010-10-05 22:03:20

ansaurus

tags:

views:

answers:

Why is my program slow ? How can I improve its efficiency ?

related questions