views:

38

answers:

2

I am trying to design a data warehouse for a state government where data about individuals may arrive from multiple agencies. These individuals may or may not have SSN's or Tax IDs -- think homeless people who are entitled to Medicaid, or data about incarcerated people from the department of corrections. As the data arrives the system would create a surrogate key for the individual and associate it with whatever identifying inforamtion -- DOB, gender, name -- may be available. However, over time it may turn out that two surrogate keys created with data from two separate departments actually represent the same person; or a single surrogate key may actually be representing twins. In such cases the keys may have to merge or divide -- possibly retroactively.

There are obviously (bi)temporal aspects to be considered here since information recorded at transaction time t1 for valid time v1 may be different than information recorded at transaction time t2 for valid time v1.

My current design loads up the identifying data into a graph with the individuals identifed as nodes. Edges are created when reliable data links two nodes. The graph is built using the data available at transaction t1. Connected components are extracted and assigned surrogate keys. The idea is to create a similar graph at transaction time t2 and then map the evolution of surrogate keys from t1 to t2. This is where I need to bring in the merge/divide issues.

I am not looking for an exact solution from the SO sommunity but pointers to literature or published designs for such a scenario. BTW, the same scenario repeats if one handles financial data as companies merge, spin-off over time and differnet agencies record the transactions differnetly at different times.

Thanks.

A: 

Someone on the business side is going to have to manually make dedup of assigned keys (easy) or unmerging (much harder) decisions. If you want to incorporate that into the initial design, you can, but you'll make the db a lot harder to work with.

Right now, btw, I work for a state agency reporting data on people participating in service programs, and from a reporting point of view, we consider one personID one person, even if it's really 2 people and needs to be unmerged, or a person with a second, different personID also counted as a second person.

Try to find a way to accept large amounts of error you cannot control, in other words.

It's hard enough to get a large complex system working if everything from every point in time is considered accurate.

Beth
+1  A: 

I'm working on a similar project. Mergeing retroactively isn't impossible - merely difficult, and very, very expensive. I've handled it by setting up a UI where a user can state that all the records for X' are really for X, and a batch job as part of the nightly ETL process can sweep through the data, performing the appropriate updates to the fact tables. Yes, it's expensive, but since it's indexed everywhere, I hope it's not completely brutal, and there won't be foreign key problems as both X and X''s records exist.

However, I don't see how you can split the records for X into X and X' without detailed analysis about whose records are whose. You may no longer have sufficiently detailed data (I know I don't in my case) to determine which records would have belonged to which version of X without, for example, reloading all of X and X''s base data, which you may or may not have.

Adam Musch