ansaurus

Question

How Can I SELECT Similar Rows in Two Different Tables in MySQL (Is It Possible?)

Answer 1

+2 A:

You can try comparing the result of calling SOUNDEX.

Or you can compare the Levenshtein distance.

Mark Byers 2010-07-30 23:34:33

Answer 2

A:

This question is pretty broad. What qualifies as similar? Anyway, while I can't give you an exact answer, I can say that you'll probably want to use a fulltext search: http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html or use some technique using the LIKE wildcard operator: %

... LIKE "new%school"

will match "new school" and "new darn school" too.

Mike Sherov 2010-07-30 23:39:21

Hi MikeThank you for the suggestions. I'll see if I can get one of them working. It's not a very precise problem. And most of the items aren't very similar-- which makes it hard.However, some of the items in the tables are closely related. For example, a lot of entries are "blah EC" and in the other table it would be "blah ES". So, I'll start with the easier ones.Thanks.

Laxmidi 2010-07-31 00:25:51

Answer 3

A:

You can try this one

select test1.test1name,test2.test2name from test1  join test2 on
soundex(SUBSTRING_INDEX(test1.test1name, ' ',1))=
soundex(SUBSTRING_INDEX(test2.test2name, ' ',1))

just put the table name and field name according to your one.

2010-07-31 07:10:22

Answer 4

+2 A:

For a UDF implementation of the Levenshtein Distance algorithm you may want to check out "codejanitor.com: Levenshtein Distance as a MySQL Stored Function":

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
  DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
  DECLARE s1_char CHAR;
  DECLARE cv0, cv1 VARBINARY(256);
  SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
  IF s1 = s2 THEN
    RETURN 0;
  ELSEIF s1_len = 0 THEN
    RETURN s2_len;
  ELSEIF s2_len = 0 THEN
    RETURN s1_len;
  ELSE
    WHILE j <= s2_len DO
      SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
    END WHILE;
    WHILE i <= s1_len DO
      SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
      WHILE j <= s2_len DO
        SET c = c + 1;
        IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
      END WHILE;
      SET cv1 = cv0, i = i + 1;
    END WHILE;
  END IF;
  RETURN c;
END

Now let's build a test case, using the data you provided in your question:

CREATE TABLE table_a (name varchar(20));
CREATE TABLE table_b (name varchar(20));

INSERT INTO table_a VALUES('Olde School');      
INSERT INTO table_a VALUES('New School');
INSERT INTO table_a VALUES('Other, C.S. School');
INSERT INTO table_a VALUES('Main School');
INSERT INTO table_a VALUES('Too Cool for School');

INSERT INTO table_b VALUES('Old School');
INSERT INTO table_b VALUES('New ES');
INSERT INTO table_b VALUES('Other School');
INSERT INTO table_b VALUES('Main School');
INSERT INTO table_b VALUES('Hardknocks School');

Then:

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (a.name = b.name);

Obviously returns a match where the school names match exactly:

+---------------------+-------------+
| name                | name        |
+---------------------+-------------+
| Olde School         | NULL        |
| New School          | NULL        |
| Other, C.S. School  | NULL        |
| Main School         | Main School |
| Too Cool for School | NULL        |
+---------------------+-------------+
5 rows in set (0.00 sec)

Now we can try to use the LEVENSHTEIN function to return school names that have an edit distance of 2 characters or less:

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (LEVENSHTEIN(a.name, b.name) <= 2);

+---------------------+-------------+
| name                | name        |
+---------------------+-------------+
| Olde School         | Old School  |
| New School          | NULL        |
| Other, C.S. School  | NULL        |
| Main School         | Main School |
| Too Cool for School | NULL        |
+---------------------+-------------+
5 rows in set (0.08 sec)

Now using <= 3 as an edit distance threshold:

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (LEVENSHTEIN(a.name, b.name) <= 3);

We get the following result:

+---------------------+--------------+
| name                | name         |
+---------------------+--------------+
| Olde School         | Old School   |
| Olde School         | Other School |
| New School          | Old School   |
| Other, C.S. School  | NULL         |
| Main School         | Main School  |
| Too Cool for School | NULL         |
+---------------------+--------------+
6 rows in set (0.06 sec)

Note how this time Olde School also matched Other School, and New School matched Old School as well. These are probably false positive, and shows that defining the threshold is very important to avoid incorrect matches.

One common technique to tackle this problem is to take into consideration the length of the strings when applying a threshold. In fact, the site that I cited for this implementation also provides a LEVENSHTEIN_RATIO function which returns the ratio (as a percentage) of the edit difference based on the length of the strings.

Daniel Vassallo 2010-07-31 08:13:23

ansaurus

tags:

views:

answers:

How Can I SELECT Similar Rows in Two Different Tables in MySQL (Is It Possible?)

School

School

related questions