I am maintaining a data warehouse with multiple sources of data about a class of entities that have to be merged. Each source has a natural key, and what is supposed to happen is that one and only one surrogate key is created for each natural key for all time. If one record from one source system with a particular natural key represents the same entity as another record from another source system with a different natural key, the same surrogate key will be assigned to both.
In other words, if source system A has natural key ABC representing the same entity as source system B's natural key DEF, we would assign the same surrogate key to both. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC DEF
That was the plan. However, this system has been in production for a while, and the surrogate key assignment is a mess. Source system A would give natural key ABC on one day, before source system B knew about it. The DW assigned surrogate key 1 to it. Then source system B started giving natural key DEF, which represents the same thing as source system A's natural key ABC. The DW incorrectly gave this combo surrogate key 2. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC NULL
2 ABC DEF
So the warehouse is a mess. There's much more complex situations than this. I have a short timeline for a cleanup that requires figuring out a clean set of surrogate key to natural key mappings.
A little Googling reveals that this can be modeled as a matching problem in a non-bipartite graph:
MIT 18.433 Combinatorial Optimization - Lecture Notes on Non-Bipartite Matching
I need an easy to understand implementation (not optimally performing) of Edmond's paths, trees, and flowers algorithm. I don't have a formal math or CS background, and what I do have is self-taught, and I'm not in a math-y headspace tonight. Can anyone help? A well written explanation that guides me to an implementation would be deeply appreciated.
EDIT:
A math approach is optimal because we want to maximize global fitness. A greedy approach (first take all instances of A, then B, then C...) paints you into a local maxima corner.
In any case, I got this pushed back to the business analysts to do manually (all 20 million of them). I'm helping them with functions to assess global match quality. This is ideal since they're the ones signing off anyways, so my backside is covered.
Not using surrogate keys doesn't change the matching problem. There's still a 1:1 natural key mapping that has to be discovered and maintained. The surrogate key is a convenient anchor for that, and nothing more.