Is there a more efficient way to reconcile large data sets?

+1 A:

The only way I can think of is to not load all of the data into memory at once. If you change the way you process it so that it grabs a bit of each file at a time it would reduce your memory foot print but increase your disk IO which would probably result in a longer processing time.

mezoid 2009-07-09 05:10:02

I'm prepared to sacrifice some processing time for less memory use. Any idea how you would not load all the data of at least one dataset in memory at once? This may be difficult, because the two datasets aren't in the same order.

Chris 2009-07-09 05:13:56

Don't know if it is important to know or not, but the memory that you are quoting, is it when there is a large mismatch (very few overlap of data, so the report is nearly of size A.txt + B.txt) or when there is a large match? if it is for a large match, and you dont care about matching data in the report, you can try deleting the matched items as soon as you find a match. You can try deleting in your in-memory data structure. Also, you might try removing the matched items from the file.

2009-07-09 05:34:46

There is a very low mismatch - eg maybe a few dozen records that don't match up.So are you suggesting to remove the matching keys from the hashtable as i load the second file? I think i've tried that but the memory usage didn't drop much. However, that may be because the GC wasn't collecting very often.

Chris 2009-07-09 05:43:14

+1 A:

One option may be to change the in-memory format of your data. If your data is a series of numbers stored as text, storing them as integers in memory may lower your memory footprint.

Another option may be use some kind of external program to sort the rows -- then you can do a simple scan of the two files in-order looking for differences.

Back to your question though, 280mb sounds high for comparing a pair of 100mb files though -- you are only loading one into memory (the smaller one) and just scrolling through the other one, right? As you describe it, I don't think you'll need to have the full contents of both in memory at once.

Jonathan 2009-07-09 05:34:22

I like the idea of using another program to sort the rows beforehand. That would certainly simplify the comparison process. But can you think of a way to sort the rows that would use less memory itself?

Chris 2009-07-09 05:44:51

I've tested using the DOS sort command: it takes 10seconds for each file, and uses 225MB of memory. So the memory usage is pretty similar to my solution anyway. But i think you're onto something.

Chris 2009-07-09 05:52:07

The nice thing about a merge sort is that in theory, you can pick a memory limit via a two-pass system -- you have it read X lines, sort them, and write them to file1, repeat for the rest of the file. Then, you read back in all the files simultaneously, writing out the lowest line and advancing only that file. Heck, if you set up an IEnumerable<string> implementation to read from the files, and a MergeSort that takes N IEnumerable<string>s, it should be relatively simple to get it to work.

Jonathan 2009-07-09 12:30:30

And now I see Chris's comment on Kevin's answer -- apparently that is called "external sorting". Makes sense.

Jonathan 2009-07-09 12:31:48

Yup, you just described an 'external mergesort'. Hey, no harm in two people coming up with similar solutions, its a good indicator that its a good solution if anything.

Chris 2009-07-09 23:05:51

Just made an implementation of that mergesort: http://splinter.com.au/blog/?p=142

Chris 2009-07-10 05:57:50

+2 A:

I have done something similar to this only in scripts on unix using shell and perl, however the theory may cary over.

Step 1, sort both files so they are in order by the same criteria. I used the unix sort command to do this (i required the unique flag, but you just need some sort of memory efficient file sort). This is likely the tricky part to figure out on you're own.

Step 2, open both files, and essentially scan them line by line (or record by record if binary format). If the line in the left file is equal to the one in the right file, then the lines match and move on (remember we already sorted the file, so the smallest record should be first).

If left record is greater than right record, you're right record is missing, add it to you're list, and read the next line on the right file. And simply do you're check again. Same thing applies if you're right record is greater, than you left record is missing, report it and keep going.

The scanning the records should be very memory efficient. It may not be as fast, but for me I was able to crunch several gigs of data with multiple passes looking at different fields witihn a couple minutes.

Kevin Nisbet 2009-07-09 05:49:14

Good plan. I think you're right, this would scale beautifully, except for the sort command which is probably the tricky part.

Chris 2009-07-09 05:53:19

Yea, i'd look up how the unix sort works, since it uses temporary files and doesn't seem too bad for efficiency for sorting text files. You might be able to do the sort as part of you're field extraction process.

Kevin Nisbet 2009-07-09 06:00:36

Just found this, which is the name for methods used to sort datasets bigger than ram:http://en.wikipedia.org/wiki/External_sortingSo that, combined with your method, sounds like it'd produce a highly scalable comparison/reconciliation method.

Chris 2009-07-09 06:10:19

Okay i've just implemented that external merge sort in such a way that it doesn't use much RAM, can be seen here: http://splinter.com.au/blog/?p=142

Chris 2009-07-10 05:56:48

Hey glad you solved it. External sorting, very interesting, glad you sent the link. I think the reason you may have found the split to be the slowest operation, is that likely you're read of the file is from disk, but when you read the split files windows may still have a copy in ram, that is being read back in which means you're only writing to disk, not reading and writing.

Kevin Nisbet 2009-07-10 07:06:51

A:

Using this method you would have to have the contents of one of the files in memory at all times though. It would be more efficient, as far as memory goes, to simply take half of the file in. Compare it line by line against the second file. Then take the second half to memory and do the same. This overlapping would ensure that there are no records missed. And would eliminate the need for the entire file to be stored temporarily.

Steve 2009-10-06 16:55:53

In addition, if you need to use less memory you can simply compare the first third of the file against the first, then the second third of the file against the first etc..Obviously there will be a larger sacrifice in program speed for increase in memory over time.

Steve 2009-10-06 17:05:45

This sounds similar to kevin's answer, but he only said to have a single line read in at a time instead of half the file. Combined with an external sort, this is the way to reconcile big datasets.

Chris 2009-10-07 03:38:13

The way that he suggests doing it will not work for large data files. His method is to sort both files externally (like mine), then open both files (unlike mine) and compare them line by line. And the computer will not have enough memory to do this. I suggest opening half of a file and then comparing that line by line to the second file. Therefore only half of one file (or less) is used at a time. Kevin's approach has both files open at the same time (at least 4 times the amount of memory)

Steve 2009-10-07 03:45:57

ansaurus

tags:

views:

answers:

Is there a more efficient way to reconcile large data sets?

related questions