views:

387

answers:

4

I've been tasked with reconciling two big data sets (two big lists of transactions). Basically i extract the relevant fields from the two data sources into two files of the same format, then compare the files to find any records that are in A but not in B, or vice versa, and report on them. I wrote a blog entry on my best efforts achieving this (click if interested).

The gist of it is to load both data sets into a big hash table, with the keys being the rows, and the values being +1 each time it appears in file A, and -1 each time it appears in file B. Then at the end, i look for any key/value pairs where the value != 0.

My algorithm seems fast enough (10 seconds for 2*100mb files), however its a bit memory-intensive: 280mb to compare two sets of 100mb files, i would hope to get it down to 100mb peak memory usage, and possibly lower if the two data sets are sorted in roughly the same order.

Any ideas?

Also, let me know if this is too open ended for SO.

+1  A: 

The only way I can think of is to not load all of the data into memory at once. If you change the way you process it so that it grabs a bit of each file at a time it would reduce your memory foot print but increase your disk IO which would probably result in a longer processing time.

mezoid
I'm prepared to sacrifice some processing time for less memory use. Any idea how you would not load all the data of at least one dataset in memory at once? This may be difficult, because the two datasets aren't in the same order.
Chris
Don't know if it is important to know or not, but the memory that you are quoting, is it when there is a large mismatch (very few overlap of data, so the report is nearly of size A.txt + B.txt) or when there is a large match? if it is for a large match, and you dont care about matching data in the report, you can try deleting the matched items as soon as you find a match. You can try deleting in your in-memory data structure. Also, you might try removing the matched items from the file.
There is a very low mismatch - eg maybe a few dozen records that don't match up.So are you suggesting to remove the matching keys from the hashtable as i load the second file? I think i've tried that but the memory usage didn't drop much. However, that may be because the GC wasn't collecting very often.
Chris
+1  A: 

One option may be to change the in-memory format of your data. If your data is a series of numbers stored as text, storing them as integers in memory may lower your memory footprint.

Another option may be use some kind of external program to sort the rows -- then you can do a simple scan of the two files in-order looking for differences.

Back to your question though, 280mb sounds high for comparing a pair of 100mb files though -- you are only loading one into memory (the smaller one) and just scrolling through the other one, right? As you describe it, I don't think you'll need to have the full contents of both in memory at once.

Jonathan
I like the idea of using another program to sort the rows beforehand. That would certainly simplify the comparison process. But can you think of a way to sort the rows that would use less memory itself?
Chris
I've tested using the DOS sort command: it takes 10seconds for each file, and uses 225MB of memory. So the memory usage is pretty similar to my solution anyway. But i think you're onto something.
Chris
The nice thing about a merge sort is that in theory, you can pick a memory limit via a two-pass system -- you have it read X lines, sort them, and write them to file1, repeat for the rest of the file. Then, you read back in all the files simultaneously, writing out the lowest line and advancing only that file. Heck, if you set up an IEnumerable<string> implementation to read from the files, and a MergeSort that takes N IEnumerable<string>s, it should be relatively simple to get it to work.
Jonathan
And now I see Chris's comment on Kevin's answer -- apparently that is called "external sorting". Makes sense.
Jonathan
Yup, you just described an 'external mergesort'. Hey, no harm in two people coming up with similar solutions, its a good indicator that its a good solution if anything.
Chris
Just made an implementation of that mergesort: http://splinter.com.au/blog/?p=142
Chris
+2  A: 

I have done something similar to this only in scripts on unix using shell and perl, however the theory may cary over.

Step 1, sort both files so they are in order by the same criteria. I used the unix sort command to do this (i required the unique flag, but you just need some sort of memory efficient file sort). This is likely the tricky part to figure out on you're own.

Step 2, open both files, and essentially scan them line by line (or record by record if binary format). If the line in the left file is equal to the one in the right file, then the lines match and move on (remember we already sorted the file, so the smallest record should be first).

If left record is greater than right record, you're right record is missing, add it to you're list, and read the next line on the right file. And simply do you're check again. Same thing applies if you're right record is greater, than you left record is missing, report it and keep going.

The scanning the records should be very memory efficient. It may not be as fast, but for me I was able to crunch several gigs of data with multiple passes looking at different fields witihn a couple minutes.

Kevin Nisbet
Good plan. I think you're right, this would scale beautifully, except for the sort command which is probably the tricky part.
Chris
Yea, i'd look up how the unix sort works, since it uses temporary files and doesn't seem too bad for efficiency for sorting text files. You might be able to do the sort as part of you're field extraction process.
Kevin Nisbet
Just found this, which is the name for methods used to sort datasets bigger than ram:http://en.wikipedia.org/wiki/External_sortingSo that, combined with your method, sounds like it'd produce a highly scalable comparison/reconciliation method.
Chris
Okay i've just implemented that external merge sort in such a way that it doesn't use much RAM, can be seen here: http://splinter.com.au/blog/?p=142
Chris
Hey glad you solved it. External sorting, very interesting, glad you sent the link. I think the reason you may have found the split to be the slowest operation, is that likely you're read of the file is from disk, but when you read the split files windows may still have a copy in ram, that is being read back in which means you're only writing to disk, not reading and writing.
Kevin Nisbet
A: 

Using this method you would have to have the contents of one of the files in memory at all times though. It would be more efficient, as far as memory goes, to simply take half of the file in. Compare it line by line against the second file. Then take the second half to memory and do the same. This overlapping would ensure that there are no records missed. And would eliminate the need for the entire file to be stored temporarily.

Steve
In addition, if you need to use less memory you can simply compare the first third of the file against the first, then the second third of the file against the first etc..Obviously there will be a larger sacrifice in program speed for increase in memory over time.
Steve
This sounds similar to kevin's answer, but he only said to have a single line read in at a time instead of half the file. Combined with an external sort, this is the way to reconcile big datasets.
Chris
The way that he suggests doing it will not work for large data files. His method is to sort both files externally (like mine), then open both files (unlike mine) and compare them line by line. And the computer will not have enough memory to do this. I suggest opening half of a file and then comparing that line by line to the second file. Therefore only half of one file (or less) is used at a time. Kevin's approach has both files open at the same time (at least 4 times the amount of memory)
Steve