views:

93

answers:

1

I'm looking for some design help here.

I'm doing work for a client that requires me to store data about their tens of thousands of employees. The data is being given to me in Excel spreadsheets, one for each city/country in which they have offices.

I have a database that contains a spreadsheets table and a data table. The data table has a column spreadsheet_id which links it back to the spreadsheets table so that I know which spreadsheet each data row came from. I also have a simple shell script which uploads the data to the database.

So far so good. However, there's some data missing from the original spreadsheets, and instead of giving me just the missing data, the client is giving me a modified version of the original spreadsheet with the new data appended to it. I cannot simply overwrite the original data since the data was already used and there are other tables that link to it.

The question is - how do I handle this? It seems to me that I have the following options:

  1. Upload the entire modified spreadsheet, and mark the original as 'inactive'.

    PROS: It's simple, straightforward, and easily automated.
    CONS: There's a lot of redundant data being stored in the database unnecessarily, especially if the spreadsheet changes numerous times.

  2. Do a diff on the spreadsheets and only upload the rows that changed.

    PROS: Less data gets loaded into the database.
    CONS: It's at least partially manual, and therefore prone to error. It also means that the database will no longer tell the entire story - e.g. if some data is missing at some later date, I will not be able to authoritatively say that I never got the data just by querying the database. And will doing diffs continue working even if I have to do it multiple times?

  3. Write a process that compares each spreadsheet row with what's in the database, inserts the rows that have changed data, and sets the original data row to inactive. (I have to keep track of the original data also, so I can't overwrite it.)

    PROS: It's automated.
    CONS: It will take time to write and test such a process, and it will be very difficult for me to justify the time spent doing so.

I'm hoping to come up with a fourth and better solution. Any ideas as to what that might be?

+1  A: 

If you have no way to be 100 % certain you can avoid human error in option 2, don't do it.

Option 3: It should not be too difficult (or time consuming) to write a VBA script that does the comparison for you. VBA is not fast, but you can let it run over night. Should not take more than one or two hours to get it running error free.

Option 1: This would be my preferred approach: Fast, simple, and I can't think of anything that could go wrong right now. (Well, you should first mark the original as 'inactive', then upload the new data set IMO). Especially if this can happen more often in the future, having a stable and fast process to deal with it is important.

If you are really worried about all the inactive entries, you can also delete them after your update (delete from spreadsheets where status='inactive' or somesuch). But so far, all databases I have seen in my work had lots of those. I wouldn't worry too much about it.

Treb