ansaurus

Question

Answer 1

A:

As each import is tied to a specific bank, I might consider putting bank_id in the import table and leaving it out of the account table.
If you are wanting to account for historical address data, and you are getting that data exclusively from your imports, you might add the address fields to the account table and remove them from the customer table. Granted, doing so can lead to duplication when you have the same address for multiple imports. If you care a lot about that, you can add another table, maybe "address", likely with a compound primary key of customer_id and address_id. Then, your import table adds the address_id field, and your import code would need to check whether the address already existed.

Andrew 2010-10-13 19:00:10

1. That's probably a good idea. 2. Since each customer has exactly one address, I don't see a reason to separate the customer fields and address fields into two tables.

Jason Swett 2010-10-13 19:12:23

@Jason Swett But, you state: "What was John Smith's address on October 13, 2010?" as a problem to solve. If the address never changes, how would you determine that?

Andrew 2010-10-13 19:52:38

Good point. Instead of thinking of each customer record as an actual customer, think of a customer as a snapshot of what was true about that customer at a particular point in time. Same with accounts. (I've edited my post with some thoughts on this at the bottom.) If you think about it that way, John Smith would have 2 customer records if he lived at 2 different addresses.

Jason Swett 2010-10-13 19:56:59

@Jason Swett True, but now you have a different problem. You know that some John Smith had some particular address on October 13, 2010. You don't know if a later or earlier John Smith is the same John Smith. Granted, you probably need a little more information in the import to make that determination, but in a way, this issue almost makes the historical data question moot. You're no longer tracking anything in particular over time.

Andrew 2010-10-13 21:04:48

The customer will always have the same SSN (I didn't have SSN in my original question). Plus, the customer tied to account number X in 2009 is probably the same customer that's tied to account number X in 2010.

Jason Swett 2010-10-13 21:10:14

Answer 2

A:

In general, the design looks good to me.

Does the import/import_id itself have any meaning outside of storing a date? If not, I see no reason why you shouldn't exclude the table altogether and put an import_date on the account table.

Also, if you want historical address information, you're going to need the import_id (or import_date :) ) on the customer table as well.

Update

As noted in the comment, adding the import_id would not account for historical address data.

What you would need is some sort of customer_history table, which would store any of the data that could change, and would link back to the customer table via foreign key.

customer
------
id
first_name
last_name (assuming name wouldn't change--it certainly could)


customer_history
-----------------
id
customer_id
import_id (or date)
(address fields)

If the details of the account could change over time, you would need a history table for that as well.

Phil Sandler 2010-10-13 19:16:45

Good catch. I left out the fact that I want to store the original Excel file in the `import` table. I've edited my question to fix this. Otherwise, you're right, there's no reason to keep the `import` table separate.

Jason Swett 2010-10-13 19:20:45

I think you're good then--I see no problem with this design.

Phil Sandler 2010-10-13 19:22:35

Also, if John Smith lives at 123 Fake St in import 100, then lives at 456 New St in import 101, he would get a new customer record. Same if he were to change his name. I think that accounts for historical data.

Jason Swett 2010-10-13 19:26:48

You're right--updated.

Phil Sandler 2010-10-13 19:45:43

Answer 3

A:

i would be cautious of the idea that a customer has only one address. (this is not true in my real life experience). You will either need to store the most recent address you get by updating the customer on each load, or you should consider splitting off the address to a new table and linking it to the customer - perhaps with begin and end dates to show you when you thought that address was valid.

I think also i would not put import_id on the account. If you do this, you will get lots of rows (x12) for each customer-to-bank connection. Not what you want, i think. instead, you could put an account-to-import link table to let you know that this account was listed in one or more of these imports.

Randy 2010-10-13 19:27:22

I'm going to comment separately for each issue. First, you're right that customers can have more than one address. I had said "For simplicity, let's say that every customer only has one address." In real life, customers can have many addresses, but that would make my question needlessly complicated. There's a lot more going on that I've left out.

Jason Swett 2010-10-13 19:32:14

Since each account comes from only one import, I don't see why it would make sense to add an account-to-import table. If I want to see which bank(s) a customer banks with, all I have to do is `SELECT b.name FROM customer c JOIN account a ON a.customer_id = c.id JOIN bank b ON a.bank_id = b.id WHERE c.name = 'John Smith'`. I don't see what you mean about lots of rows for each customer-to-bank connection.

Jason Swett 2010-10-13 19:37:41

ok - i made an assumption that you were getting a dump of accounts form each bank each month - which would have duplicates. If this is already pre-filtered to only new accounts, then you should not expect duplicates. Still, i would be concerned in the what-if scenario when you do receive more than one, and also, what happens when an account is closed?

Randy 2010-10-13 19:53:07

Ah, I see. Under that assumption, what you said would make perfect sense. When an account is closed, we would just stop creating new records for it. (Although it may make sense to somehow store closing dates of accounts so we can say "this account was closed on 10/13/10" instead of "this account was closed between import 422 and import 423.")

Jason Swett 2010-10-13 19:58:43

Answer 4

+1 A:

I'm sorry, I can't reconcile the statements "each customer has only one address" and "we want to say 'What was John Smith's address on October 13th, 2010'". Are you suggesting that on each import, you'll create a new customer record for each person found in the import? If so, how will you know that John Smith in one import is the same John Smith from another import if the account numbers are different?

And if you reuse the same customer record for the same customer (which seems correct for me) where to you find prior address information?

[After comments and amendments by the poster]

Okay, you're almost there. You do need to add the customer address to the Account table (which should really be renamed AccountImports or something like that). That's because each import might have a different address.

Storing the address in AccountImports is slightly un-normal if the address frequently stays the same from import to import. If so, you can add a CustomerAddressHistory table. During each import, check the latest address for the SSN in CustomerAddressHistory and, if not the same as the import, add the new address to a new record in that table.

Larry Lustig 2010-10-13 19:39:00

I see exactly where you're coming from. Instead of thinking of each customer record as an actual customer, think of a customer as a snapshot of what was true about that customer at a particular point in time. Same with accounts. (I've edited my post with some thoughts on this at the bottom.)

Jason Swett 2010-10-13 19:52:36

That's a very good question concerning how I would tell that John Smith from one import is the same as John Smith from another. I will have to think about that.

Jason Swett 2010-10-13 19:53:31

Ideally, you'd have two tables — one the represented the customer and another that represented an address-in-time datum.

Larry Lustig 2010-10-13 19:56:05

I can buy the idea that John Smith has only one address (at a time, I assume you mean). I _can't_ believe there's only one John Smith, however.

Larry Lustig 2010-10-13 19:56:35

Ah, yes: every customer "snapshot" will always have the same social security number as every other snapshot that represents the same customer. (I've updated my post slightly to reflect this.)

Jason Swett 2010-10-13 20:01:07

Lol @ the John Smith banter, I use/see that example all the time :)

pheadbaq 2010-10-13 20:57:24

Answer 5

A:

No clue what DB you're using, but here goes: I would NOT have stored the import as a blob, as it impedes your ability to link with your existing data because you have to process the blob as the type of file you expect it to be before you can join it with any of your other data. Import the data directly into your import table along w/ the id and date field you already have. Put a key on id, then a unique compound index on date, bank, and account to prevent same-date dups.

If you know for certain you will only ever have 12 imports a year (months, I presume?), you could increase integrity by creating two calculated fields, one for date_month (to hold JUST the month), and one for date_year (to hold JUST the year), and then create a unique compound index on bank id, account, date_month, and date_year. This would prevent accidental re-imports for the same month's data on different dates, like if the import for Oct was done on Mon, then someone did it again on Tues. It would also prevent "oops I clicked the button again" or "oops, I imported this month's data as last month's" scenarios. To speed up the checks on the calculated fields, put unique indexes on date_month and date_year.

If you want your customer table to always reflect the current address without any fuss, make address a calculated field that does a lookup into your import table by customer account (or SSN or etc) and picks out the TOP 1 address sorted by date DESC. If you want queries on or including the address field to be faster, put an index on it.

pheadbaq 2010-10-13 20:27:42

Those are all good points. The blob is only for accountability, though, not for looking up any data.

Jason Swett 2010-10-13 20:35:57

I see, so for any month, you'd have x blobs... x being the number of banks that should have an import done. If that's the case, importing the data directly gives you your accountability (either way, you can confirm the import by doing a lookup on the import table for bank id and date), plus historical data (most recent accounts, balances, names, and addresses).If you had a digitally signed file or something, I could see storing the blob for accountability, but otherwise I'd be looking at how to easily make the data available to a query.

pheadbaq 2010-10-13 20:48:55

Might be off topic, but since we're talking historical data, if you end up wanting to data mine the info in those blobs, then sooner or later the data in those x files for x months for x years will all have to be extracted, scrubbed, and consolidated. If the banks end up changing formats over the years, you'll have a nice data-scrubbing nightmare on your hands. Just saying :)

pheadbaq 2010-10-13 21:05:16

Answer 6

A:

I would make a new table Called CustomerAddress and move the address informaiton out of customer into this new table

Then on the Account table and CustomerAddress table add 2 new columns StartDate and EndDate

That way you get to keep a single row of a customer overtime and can easily track each of the customers accounts and Addresses overtime. it gets too confusing if you try to keep mutiple copies of a customer.

Daveo 2010-10-13 22:59:37

Answer 7

A:

There were a lot of good points here but nothing that was really an answer that I was 100% satisfied with, probably because it was a badly-formed question. In order to "improve" my accept rate, I'm "answering" my own question.

Jason Swett 2010-10-29 12:57:56

ansaurus

tags:

views:

answers:

Normalization and historical data

related questions