ansaurus

Question

Comparing records in file and reporting stats - Scenario 1

Answer 1

+2 A:

Tie::File is very slow. There are two reasons for this: First, tied variables are significantly slower than standard variables. The other reason is that in the case of Tie::File the data in your array is on disk rather than in memory. This greatly slows access. Tie::File's cache can help performance in some circumstances but not when you just loop over the array one element at a time as you do here. (The cache only helps if you revisit the same index.) The time to use Tie::File is when you have an algorithm that requires having all the data in memory at once but you don't have enough memory to do so. Since you're only processing the file one line at a time using Tie::File is not only pointless, it's harmful.

I don't think a trie is the right choice here. I'd use a plain HoH (hash of hashes) instead. Your files are small enough that you should be able to get everything in memory at once. I recommend parsing each file and building a hash that looks like this:

%data = (
  id1 => {
    field1 => value1,
    field2 => value2,
  },
  id2 => {
    field1 => value1,
    field2 => value2,
  },
);

If you use your mappings to normalize the field names while building the data structure it will make the comparison easier.

To compare the data, do this:

Perform a set comparison of the keys of the two hashes. This should generate three lists: The IDs present in just the legacy data, the IDs present in just the new data, and the IDs present in both.
Report the lists of IDs that only appear in one data set. These are records that don't have a corresponding record in the other data set.
For the IDs in both data sets, compare the data for each ID field by field and report any differences.

Michael Carman 2009-05-21 02:59:19

ansaurus

tags:

views:

answers:

Comparing records in file and reporting stats - Scenario 1

related questions