views:

176

answers:

1

Hi all,

table a (t_a):
id  name    last    first   email           state   country
0   sklass  klass   steve   [email protected]  in      uk
1   jabid   abid    john    [email protected]    ny      us
2   jcolle  colle   john    [email protected]  wi      us


table b (t_b):
id  sn      given   nick    email           l   c   
0   steven  klass   steve   [email protected]  in  uk
1   john    abid    -       [email protected]  ny  us
2   johnny  colle   john    [email protected]  wi  us
3   john    abid    -       [email protected]    ny  us

What is listed above is an (abbreviated) column and row mySQL tables. Looking at the two tables it becomes pretty clear that by strictly looking at the values (id's not looked at) and comparing the matching number of values you would get this value matches.

t_a     t_b
0       0
1       3
2       2
-       1

What I ultimately looking to do is to do this in django — I'm not sure if that matters. In the past I have done this using pure python in which I destroy the old data and just create three new tables. I want to shift away from my implementation (listed below) because the problems I see is that time changes things and people come and go. In the past I have just regenerated the data -- but now I want to keep track of when people leave and don't simply replace (delete) the data. I believe that by doing a SQL update is more elegant and preserves the history. So...

I'd like to know how to get this merged answer directly from mySQL (Either a SQL function or the construction of a new table) which merges the data in the following manner. I want to do this using pure SQL (I believe then I can do this in django). So I am looking for a solution which meets the following criteria:

  1. There is a min_match which defines the minimum number of matches between the two rows of which must be aligned to be considered valid.
  2. While the tables may have different lengths it is a 1-to-1 mapping. In other words many to one may not happen (yet)

Now my background is python and for me the simplest way to do this has always been to do a for loop over the shorter of the two tables, which then does a for loop over the other table. looking at the number of matches. In code this looks like this.

t_a = [ ["sklass", "klass", "steve", "[email protected]", "in", "uk", ],
        ["jabid", "abid", "john", "[email protected]", "ny", "us", ],
        ["jcolle", "colle", "john", "[email protected]", "wi", "us", ], ]

t_b = [ ["steven", "klass", "steve", "[email protected]", "in", "uk",],
        ["john", "abid", "[email protected]", "ny", "us",],
        ["johnny", "colle", "john", "[email protected]", "wi", "us",],
        ["john", "abid", "[email protected]", "ny", "us",], ]

min_match = 3

for person_a in t_a:
    match = 0
    match_pct = 0.0
    match_a_index = t_a.index(person_a)
    for person_b in t_b:
        new_match_count = len(list(set(person_a) & set(person_b)))
        if new_match_count > match:
            match = new_match_count
            match_b_index = t_b.index(person_b)
            match_pct = "%.2f" % (float(new_match_count) / \
              float(len(set(person_a + person_b))) * 100)
    if match >= min_match:
        print match_a_index, match_b_index #, match_pct, match

Updates - And more clarification

The comments beg the question why don't you just join on the email address. I don't necessarily know that the values in a column will match. I am certain that values from a given row in t_a will match values for a row in t_b. I want the highest (most probable) match for a given row in t_a to t_b and only if the number of matches is higher than min_match.

Thanks again

+1  A: 

You can do this in MySQL directly via a cursor executed through a stored procedure.

DELIMITER $$
CREATE PROCEDURE `proc_name`()
BEGIN
  DECLARE done INT DEFAULT 0;
  DECLARE a_id BIGINT UNSIGNED;
  DECLARE b_id BIGINT UNSIGNED;
  DECLARE x_count INT;

  -- something like the following
  DECLARE cur1 CURSOR FOR SELECT t_a.id, t_b.id FROM t_a, t_b WHERE t_a.email = t_b.email;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;

  SELECT COUNT(*) INTO x_count FROM t_a, t_b WHERE t_a.email = t_b.email;

  IF(x_count > <some_min_value>) THEN

    OPEN cur1;

    REPEAT
      FETCH cur1 INTO a_id, b_id;
      IF NOT done THEN

        -- do something here like update rows, remove rows, etc.
        -- a_id and b_id hold the two id values for the two tables which
        -- I assume to be primary keys

      END IF;
    UNTIL done END REPEAT;

    CLOSE cur1;

  END IF;
END
$$
tracy.brown
Hey Tracy,Great!! So this assumes that you have columns where the values would match. But I don't necessarily know what columns would match (Example a nickname in one column may match a a one or more columns in the other table.) So I think your close but is there a way to iteratively count the matching values?Thanks Again!!
rh0dium