views:

36

answers:

1

Hello,

The situation I'm facing is as follows:

There are large number of 'flat' files from which data is extracted by a C# app in order to create entries which are in turn written in a database (MS SQL server). A full release of the database comprises of ~ 97 million entries across 220 GB.

The task is to create a differential update of the data in the database by parsing a new full release and finding out which of the existing entries have been updated. An entry is considered to be updated if any of its properties has been changed.
[UPDATE] Each entry has a unique ID.

The problem is that the data provider does not supply any indication of entry modification (a version number or a last modification date) - only full releases.

The solution I've come up with so far is to generate a hash sum for each entry and then compare the new to the old one.
The other aspect of the issue which makes hash sums undesirable is the combo between the size of the data and number of entries - it's just staggering.

So, is there a better solution than this?

Any help with the case will be greatly appreciated!

All the best, Borislav

A: 

Is there a key that you can use to uniquely identify a record?

If not, you can only find the ones that are identical. Then you would need to remove all existing records not matched in the new release and add all the ones from the release that do not match a record in the existing release.

Having a key would make things much easier though.

Johann Blais
Yes, there is a unique ID for each entry - I've updated the question.Removing an entry purely for the reason that it exists is fine in terms of performance, but updated entries need to be found and marked as such - that's what's been puzzling me.
Borislav T