Before I describe my problem, I'd like to get a couple things out of the way:
- I'm an experienced (though not expert) database designer. I believe I have a good grasp of the relational model.
- I don't have such a firm understanding of the relational model that I know exactly what to do in every situation. I'm still learning.
Let's say we get an Excel spreadsheet once a month from a bank, but not always the same bank. The spreadsheet has just six columns: bank name, account number, account balance, customer (accountholder) name, customer SSN and accountholder address. Each row has a different account number and no account number is listed in more than one row. We want to import this spreadsheet into a database and, at any time in the future, say, "What was John Smith's address on October 13, 2010?"
For simplicity, let's say that every customer only has one address and that every customer can have zero or more accounts. And just for a second, let's pretend that we only have to do one Excel sheet import EVER, which is a silly premise, but bear with me. If that's the case, the following design would suffice:
bank
--------
id
name
account
--------
id
bank_id
customer_id
number
balance
customer
--------
id
name
ssn
address
city
state_id
zip
state
--------
id
name
The rest of my question is based on the premise that you agree that that schema is "correct", so hopefully you're fine with it.
Now, that would be fine if we only ever did one import, but we'll be doing 12 imports per bank per year. Here's how I was thinking of accounting for that:
bank
--------
id
name
account
--------
id
import_id
bank_id
customer_id
number
balance
customer
--------
id
name
ssn
address
city
state_id
zip
state
--------
id
name
import
--------
id
date
excel_file (blob)
Now every account is tied to an import and we can say with certainty things like "Account 12345 came from import 572 on 10/13/10." It gets potentially a little more ambiguous when you look at, say, the customer
table. Since there are less rows in the customer
table than in the account
table (because some customers have multiple accounts), we don't have that one-to-one relationship between customers and imports like we do for accounts and imports. I know there's no data loss and there's no loss of data integrity, but it still feels like some sort of sacrifice somehow.
My question is (and this may be too open-ended): Do you think this is a good way to store the data? Would you have done it differently?
Edit: there's an important way of thinking about these entities that you have to be aware of. Don't think of an account
as one account that exists over time. Think of an account
as a snapshot of an account at a certain point in time. Therefore, account 12345 with balance $100 is NOT the same account
as account 12345 with balance $150. Yes, both records are tied to the same bank account in the real world, but what I'm storing is a snapshot of the account at a certain point in time. Similar (but not identical) situation with customers.