views:

84

answers:

7

Before I describe my problem, I'd like to get a couple things out of the way:

  1. I'm an experienced (though not expert) database designer. I believe I have a good grasp of the relational model.
  2. I don't have such a firm understanding of the relational model that I know exactly what to do in every situation. I'm still learning.

Let's say we get an Excel spreadsheet once a month from a bank, but not always the same bank. The spreadsheet has just six columns: bank name, account number, account balance, customer (accountholder) name, customer SSN and accountholder address. Each row has a different account number and no account number is listed in more than one row. We want to import this spreadsheet into a database and, at any time in the future, say, "What was John Smith's address on October 13, 2010?"

For simplicity, let's say that every customer only has one address and that every customer can have zero or more accounts. And just for a second, let's pretend that we only have to do one Excel sheet import EVER, which is a silly premise, but bear with me. If that's the case, the following design would suffice:

bank
--------
id
name

account
--------
id
bank_id
customer_id
number
balance

customer
--------
id
name
ssn
address
city
state_id
zip

state
--------
id
name

The rest of my question is based on the premise that you agree that that schema is "correct", so hopefully you're fine with it.

Now, that would be fine if we only ever did one import, but we'll be doing 12 imports per bank per year. Here's how I was thinking of accounting for that:

bank
--------
id
name

account
--------
id
import_id
bank_id
customer_id
number
balance

customer
--------
id
name
ssn
address
city
state_id
zip

state
--------
id
name

import
--------
id
date
excel_file (blob)

Now every account is tied to an import and we can say with certainty things like "Account 12345 came from import 572 on 10/13/10." It gets potentially a little more ambiguous when you look at, say, the customer table. Since there are less rows in the customer table than in the account table (because some customers have multiple accounts), we don't have that one-to-one relationship between customers and imports like we do for accounts and imports. I know there's no data loss and there's no loss of data integrity, but it still feels like some sort of sacrifice somehow.

My question is (and this may be too open-ended): Do you think this is a good way to store the data? Would you have done it differently?

Edit: there's an important way of thinking about these entities that you have to be aware of. Don't think of an account as one account that exists over time. Think of an account as a snapshot of an account at a certain point in time. Therefore, account 12345 with balance $100 is NOT the same account as account 12345 with balance $150. Yes, both records are tied to the same bank account in the real world, but what I'm storing is a snapshot of the account at a certain point in time. Similar (but not identical) situation with customers.

A: 
  1. As each import is tied to a specific bank, I might consider putting bank_id in the import table and leaving it out of the account table.
  2. If you are wanting to account for historical address data, and you are getting that data exclusively from your imports, you might add the address fields to the account table and remove them from the customer table. Granted, doing so can lead to duplication when you have the same address for multiple imports. If you care a lot about that, you can add another table, maybe "address", likely with a compound primary key of customer_id and address_id. Then, your import table adds the address_id field, and your import code would need to check whether the address already existed.
Andrew
1. That's probably a good idea. 2. Since each customer has exactly one address, I don't see a reason to separate the customer fields and address fields into two tables.
Jason Swett
@Jason Swett But, you state: "What was John Smith's address on October 13, 2010?" as a problem to solve. If the address never changes, how would you determine that?
Andrew
Good point. Instead of thinking of each customer record as an actual customer, think of a customer as a snapshot of what was true about that customer at a particular point in time. Same with accounts. (I've edited my post with some thoughts on this at the bottom.) If you think about it that way, John Smith would have 2 customer records if he lived at 2 different addresses.
Jason Swett
@Jason Swett True, but now you have a different problem. You know that some John Smith had some particular address on October 13, 2010. You don't know if a later or earlier John Smith is the same John Smith. Granted, you probably need a little more information in the import to make that determination, but in a way, this issue almost makes the historical data question moot. You're no longer tracking anything in particular over time.
Andrew
The customer will always have the same SSN (I didn't have SSN in my original question). Plus, the customer tied to account number X in 2009 is probably the same customer that's tied to account number X in 2010.
Jason Swett
A: 

In general, the design looks good to me.

Does the import/import_id itself have any meaning outside of storing a date? If not, I see no reason why you shouldn't exclude the table altogether and put an import_date on the account table.

Also, if you want historical address information, you're going to need the import_id (or import_date :) ) on the customer table as well.

Update

As noted in the comment, adding the import_id would not account for historical address data.

What you would need is some sort of customer_history table, which would store any of the data that could change, and would link back to the customer table via foreign key.

customer
------
id
first_name
last_name (assuming name wouldn't change--it certainly could)


customer_history
-----------------
id
customer_id
import_id (or date)
(address fields)

If the details of the account could change over time, you would need a history table for that as well.

Phil Sandler
Good catch. I left out the fact that I want to store the original Excel file in the `import` table. I've edited my question to fix this. Otherwise, you're right, there's no reason to keep the `import` table separate.
Jason Swett
I think you're good then--I see no problem with this design.
Phil Sandler
Also, if John Smith lives at 123 Fake St in import 100, then lives at 456 New St in import 101, he would get a new customer record. Same if he were to change his name. I think that accounts for historical data.
Jason Swett
You're right--updated.
Phil Sandler
A: 

i would be cautious of the idea that a customer has only one address. (this is not true in my real life experience). You will either need to store the most recent address you get by updating the customer on each load, or you should consider splitting off the address to a new table and linking it to the customer - perhaps with begin and end dates to show you when you thought that address was valid.

I think also i would not put import_id on the account. If you do this, you will get lots of rows (x12) for each customer-to-bank connection. Not what you want, i think. instead, you could put an account-to-import link table to let you know that this account was listed in one or more of these imports.

Randy
I'm going to comment separately for each issue. First, you're right that customers can have more than one address. I had said "For simplicity, let's say that every customer only has one address." In real life, customers can have many addresses, but that would make my question needlessly complicated. There's a lot more going on that I've left out.
Jason Swett
Since each account comes from only one import, I don't see why it would make sense to add an account-to-import table. If I want to see which bank(s) a customer banks with, all I have to do is `SELECT b.name FROM customer c JOIN account a ON a.customer_id = c.id JOIN bank b ON a.bank_id = b.id WHERE c.name = 'John Smith'`. I don't see what you mean about lots of rows for each customer-to-bank connection.
Jason Swett
ok - i made an assumption that you were getting a dump of accounts form each bank each month - which would have duplicates. If this is already pre-filtered to only new accounts, then you should not expect duplicates. Still, i would be concerned in the what-if scenario when you do receive more than one, and also, what happens when an account is closed?
Randy
Ah, I see. Under that assumption, what you said would make perfect sense. When an account is closed, we would just stop creating new records for it. (Although it may make sense to somehow store closing dates of accounts so we can say "this account was closed on 10/13/10" instead of "this account was closed between import 422 and import 423.")
Jason Swett
+1  A: 

I'm sorry, I can't reconcile the statements "each customer has only one address" and "we want to say 'What was John Smith's address on October 13th, 2010'". Are you suggesting that on each import, you'll create a new customer record for each person found in the import? If so, how will you know that John Smith in one import is the same John Smith from another import if the account numbers are different?

And if you reuse the same customer record for the same customer (which seems correct for me) where to you find prior address information?

[After comments and amendments by the poster]

Okay, you're almost there. You do need to add the customer address to the Account table (which should really be renamed AccountImports or something like that). That's because each import might have a different address.

Storing the address in AccountImports is slightly un-normal if the address frequently stays the same from import to import. If so, you can add a CustomerAddressHistory table. During each import, check the latest address for the SSN in CustomerAddressHistory and, if not the same as the import, add the new address to a new record in that table.

Larry Lustig
I see exactly where you're coming from. Instead of thinking of each customer record as an actual customer, think of a customer as a snapshot of what was true about that customer at a particular point in time. Same with accounts. (I've edited my post with some thoughts on this at the bottom.)
Jason Swett
That's a very good question concerning how I would tell that John Smith from one import is the same as John Smith from another. I will have to think about that.
Jason Swett
Ideally, you'd have two tables — one the represented the customer and another that represented an address-in-time datum.
Larry Lustig
I can buy the idea that John Smith has only one address (at a time, I assume you mean). I _can't_ believe there's only one John Smith, however.
Larry Lustig
Ah, yes: every customer "snapshot" will always have the same social security number as every other snapshot that represents the same customer. (I've updated my post slightly to reflect this.)
Jason Swett
Lol @ the John Smith banter, I use/see that example all the time :)
pheadbaq
A: 

No clue what DB you're using, but here goes: I would NOT have stored the import as a blob, as it impedes your ability to link with your existing data because you have to process the blob as the type of file you expect it to be before you can join it with any of your other data. Import the data directly into your import table along w/ the id and date field you already have. Put a key on id, then a unique compound index on date, bank, and account to prevent same-date dups.

If you know for certain you will only ever have 12 imports a year (months, I presume?), you could increase integrity by creating two calculated fields, one for date_month (to hold JUST the month), and one for date_year (to hold JUST the year), and then create a unique compound index on bank id, account, date_month, and date_year. This would prevent accidental re-imports for the same month's data on different dates, like if the import for Oct was done on Mon, then someone did it again on Tues. It would also prevent "oops I clicked the button again" or "oops, I imported this month's data as last month's" scenarios. To speed up the checks on the calculated fields, put unique indexes on date_month and date_year.

If you want your customer table to always reflect the current address without any fuss, make address a calculated field that does a lookup into your import table by customer account (or SSN or etc) and picks out the TOP 1 address sorted by date DESC. If you want queries on or including the address field to be faster, put an index on it.

pheadbaq
Those are all good points. The blob is only for accountability, though, not for looking up any data.
Jason Swett
I see, so for any month, you'd have x blobs... x being the number of banks that should have an import done. If that's the case, importing the data directly gives you your accountability (either way, you can confirm the import by doing a lookup on the import table for bank id and date), plus historical data (most recent accounts, balances, names, and addresses).If you had a digitally signed file or something, I could see storing the blob for accountability, but otherwise I'd be looking at how to easily make the data available to a query.
pheadbaq
Might be off topic, but since we're talking historical data, if you end up wanting to data mine the info in those blobs, then sooner or later the data in those x files for x months for x years will all have to be extracted, scrubbed, and consolidated. If the banks end up changing formats over the years, you'll have a nice data-scrubbing nightmare on your hands. Just saying :)
pheadbaq
A: 

I would make a new table Called CustomerAddress and move the address informaiton out of customer into this new table

Then on the Account table and CustomerAddress table add 2 new columns StartDate and EndDate

That way you get to keep a single row of a customer overtime and can easily track each of the customers accounts and Addresses overtime. it gets too confusing if you try to keep mutiple copies of a customer.

Daveo
A: 

There were a lot of good points here but nothing that was really an answer that I was 100% satisfied with, probably because it was a badly-formed question. In order to "improve" my accept rate, I'm "answering" my own question.

Jason Swett