Hi all,
table a (t_a):
id name last first email state country
0 sklass klass steve [email protected] in uk
1 jabid abid john [email protected] ny us
2 jcolle colle john [email protected] wi us
table b (t_b):
id sn given nick email l c
0 steven klass steve [email protected] in uk
1 john abid - [email protected] ny us
2 johnny colle john [email protected] wi us
3 john abid - [email protected] ny us
What is listed above is an (abbreviated) column and row mySQL tables. Looking at the two tables it becomes pretty clear that by strictly looking at the values (id's not looked at) and comparing the matching number of values you would get this value matches.
t_a t_b
0 0
1 3
2 2
- 1
What I ultimately looking to do is to do this in django — I'm not sure if that matters. In the past I have done this using pure python in which I destroy the old data and just create three new tables. I want to shift away from my implementation (listed below) because the problems I see is that time changes things and people come and go. In the past I have just regenerated the data -- but now I want to keep track of when people leave and don't simply replace (delete) the data. I believe that by doing a SQL update is more elegant and preserves the history. So...
I'd like to know how to get this merged answer directly from mySQL (Either a SQL function or the construction of a new table) which merges the data in the following manner. I want to do this using pure SQL (I believe then I can do this in django). So I am looking for a solution which meets the following criteria:
- There is a min_match which defines the minimum number of matches between the two rows of which must be aligned to be considered valid.
- While the tables may have different lengths it is a 1-to-1 mapping. In other words many to one may not happen (yet)
Now my background is python and for me the simplest way to do this has always been to do a for loop over the shorter of the two tables, which then does a for loop over the other table. looking at the number of matches. In code this looks like this.
t_a = [ ["sklass", "klass", "steve", "[email protected]", "in", "uk", ],
["jabid", "abid", "john", "[email protected]", "ny", "us", ],
["jcolle", "colle", "john", "[email protected]", "wi", "us", ], ]
t_b = [ ["steven", "klass", "steve", "[email protected]", "in", "uk",],
["john", "abid", "[email protected]", "ny", "us",],
["johnny", "colle", "john", "[email protected]", "wi", "us",],
["john", "abid", "[email protected]", "ny", "us",], ]
min_match = 3
for person_a in t_a:
match = 0
match_pct = 0.0
match_a_index = t_a.index(person_a)
for person_b in t_b:
new_match_count = len(list(set(person_a) & set(person_b)))
if new_match_count > match:
match = new_match_count
match_b_index = t_b.index(person_b)
match_pct = "%.2f" % (float(new_match_count) / \
float(len(set(person_a + person_b))) * 100)
if match >= min_match:
print match_a_index, match_b_index #, match_pct, match
Updates - And more clarification
The comments beg the question why don't you just join on the email address. I don't necessarily know that the values in a column will match. I am certain that values from a given row in t_a will match values for a row in t_b. I want the highest (most probable) match for a given row in t_a to t_b and only if the number of matches is higher than min_match.
Thanks again