I am writing some code to find duplicate customer details in a database. I'll be using Levenshtein distance.
However, I am not sure how to store the relationships. I use databases all the time but have never come accross this situation and wondered if someone could point me in the right direction.
What confuses me is how to store the bidirectional nature of the relationship.
I've started to put some examples below, but wondered if there is a best practice for storing this type of data,
Example data
id, address
001, 5 Main Street
002, 5 Main St.
003, 5 Main Str
004, 6 High Street
005, 7 Low Street
006, 7 Low St
Suggestion 1
customer_id1, customer_id2, relationship_strength
001, 002, 0.74
001, 003, 0.77
002, 003, 0.76
005, 006, 0.77
Not happy with this approach as it sort of infers a one way relationship between customer_id1 to customer_id2. Unless of course I include all relationships both ways, but that would double the amount of processing time and the size of the tables.
eg would need to include: 002, 001, 0.74
Suggestion 2
customer_id, grouping_id
001, 1
002, 1
003, 1
005, 2
006, 2