views:

72

answers:

2

I'm trying to return duplicate records in a user table where the fields only partially match, and the matching field contents are arbitrary. I'm not sure if I'm explaining it well, so here is the query I might run to get the duplicate members by some unique field:

SELECT MAX(id)
FROM members
WHERE 1
GROUP BY some_unique_field
HAVING COUNT(some_unique_field) > 1

I want to apply this same idea to an email field, but unfortunately our email field can contain multiple e-mails seperated by a comma. For example, I want a member with his email set to "[email protected]" to be returned as a duplicate of another member that has "[email protected]","[email protected]" in their field. GROUP BY obviously will not accomplish this as-is.

A: 

Something like this might work for you:

SELECT *
FROM members m1
inner join members m2 on m1.id <> m2.id
    and (
        m1.email = m2.email
        or m1.email like '%,' + m2.email
        or m1.email like m2.email + ',%'
        or m1.email like '%,' + m2.email + ',%'
    )   

It depends on how consistently your email addresses are formatted when there are more than one. You might need to modify the query slightly if there is always a space after the comma, e.g., or if the quotes are actually part of your data.

RedFilter
Thanks for the answer. Unfortunately the INNER JOIN of our members table is 94 million records and the query takes too long, which is why I was shying away from joins of this nature. I think that if I separated the email addresses out into their own table like they SHOULD be, I can accomplish what I want more easily.
Mitch Weaver
A: 

This works for me; may not do what you want:

SELECT MAX(ID) FROM members WHERE Email like "%someuser%" GROUP BY Email HAVING COUNT(Email) > 1

Nat
This works great as long as you can guarantee your email field contains only one email. Our may contain multiples separated by commas, and I'm trying to group partial matches, which doesn't appear to be feasible as our schema exists now.
Mitch Weaver