tags:

views:

197

answers:

4

I would like to find all duplicate records by name in a customer table using MySQL including those that do not match exactly.

I know I can use the query

SELECT id, name FROM customer GROUP BY name HAVING count(*) > 1;

to find all rows that match exactly, but I want to find all duplicate rows matching with a LIKE clause. For instance there might be a customer with the name "Mark's Widgets" and another "Mark's Widgets Inc." I would like my query to find these as duplicates. So something along the lines of

SELECT id, name AS name1 ... WHERE name1 LIKE CONCAT("%", name2, "%") ...

I know that's completely incorrect but that's the idea. Here is the able schema:

mysql> describe customer;
+-----------------------------+--------------+------+-----+------------+----------------+
| Field                       | Type         | Null | Key | Default    | Extra          |
+-----------------------------+--------------+------+-----+------------+----------------+
| id                          | int(11)      | NO   | PRI | NULL       | auto_increment |
| name                        | varchar(140) | NO   |     | NULL       |                |
 ...

EDIT: To clarify, I want to find all duplicates, not just duplicates of one specific customer name.

A: 
SELECT * FROM customer WHERE name LIKE "%Mark's Widgets%";

http://www.mysqltutorial.org/sql-like-mysql.aspx should also help with the LIKE command.

Not sure why you're needing to use the CONCAT section though, so this might be too simple.

foxed
Maybe I wasn't clear enough. I want to find all duplicates, not just duplicates of one specific customer name. To the same affect as the first query in the example, but using LIKE.
markb
+2  A: 

It's quite possible to do this, but before you even begin you need to define your rules regarding what is a match and what is not, without that you can't go anywhere.

You could, for example, ignore the first and last 3 characters of the name and match on the middle characters, or you could choose more complex logic, but there is no magic method of achieving what you want, you will have to code the logic. Whatever your choice it needs to be defined before you start and before we can really help much.

No mysql here so excuse the syntax errors ( its t-sql syntax if any) but i'm thinking a self join

SELECT
    t1.ID
FROM MyTable t1
LEFT OUTER JOIN MyTable t2
ON t1.name LIKE '%' + t2.name + '%'
group by t1.ID
HAVING count(*) > 1
Paul Creasey
I think a good start is one name being a sub string of another. The kind of matching I was looking for was name1 LIKE %name2%
markb
@markb, OK i edited a possible solution.
Paul Creasey
Here is the MySQL syntax: SELECT t1.ID, t1.name FROM customer t1 LEFT OUTER JOIN customer t2 ON t1.name LIKE CONCAT('%', t2.name, '%') group by t1.ID HAVING count(*) > 1;
markb
A: 

I think this will work, but in my experience, having functions inside ONs takes a ridiculous amount of time to process, particularly in combination with the LIKE operator. Still, it's marginally better than a cross join.

SELECT 
 cust1.id,
 cust1.name
FROM
 customer AS cust1
 INNER JOIN customer AS cust2 ON 
 (cust1.name LIKE (CONCAT('%',CONCAT(cust2.name,'%'))))
GROUP BY
 cust1.id,
 cust1.name
HAVING
 count(*) > 1
Uldeim
A: 

How about this. You can substitute the a.name=b.name with your like if that makes a difference.

Select a.id, b.id from customer a, customer b where a.name = b.name and a.id != b.id;
Joshua