views:

64

answers:

2

I have a table of data and there are many duplicate entries from user submissions.

I want to delete all duplicates rows based on the field subscriberEmail, leaving only the original submission.

In other words, I want to search for all duplicate emails, and delete those rows, leaving only the original.

How can I do this without swapping tables? My table contains unique IDs for each row

+2  A: 

Since you're using the id column as an indicator of which record is 'original':

delete x 
from myTable x
 join myTable z on x.subscriberEmail = z.subscriberEmail
where x.id > z.id

This will leave one record per email address.

edit to add:

To explain the query above...

The idea here is to join the table against itself. Pretend that you have two copies of the table, each named something different. Then you could compare them to each other, and find the lowest id or for each email address. You'd then see the duplicate records that were created later on and could delete them. (I was visualizing Excel when thinking about this.)

In order to do that operation on a table, compare it to itself and be able to identify each side, you use table aliases. x is a table alias. It is assigned in the from clause like so: from <table> <alias>. x can now be used elsewhere in the same query to refer to that table as a shortcut.

delete x starts the query off with our action and target. We're going to perform a query to select records from multiple tables, and we want to delete records that appear in x.

Aliases are used to refer to both 'instances' of the table. from myTable x join myTable z on x.subscriberEmail = z.subscriberEmail bumps the table up against itself where the emails match. Without the where clause that follows, every record would be selected as it could be joined up against itself.

The where clause limits the records that are selected. where x.createdOn > z.createdOn allows the 'instance' aliased x to contain only the records that match emails but have a higher id value. The data that you really want in the table, unique email addresses (with the lowest id) will not be part of x and will not be deleted. The only records in x will be duplicate records (email addresses) that have a higher id than the original record for that email address.

The join and where clauses could be combined in this case:

delete x 
  from myTable x 
  join myTable z
    on x.subscriberEmail = z.subscriberEmail
      and x.createdOn > z.createdOn

For preventing duplicates, consider making the subscriberEmail column a UNIQUE indexed column.

Fosco
Hey, I don't understand what X is in this, and where the date is input. I could use ID since there is primary ID set for each, or date. ID sounds easier
RB. J
Hey, it worked!!! I changed "createdOn" to "id" and presto! haha thanks - Heres the code I used in PHP:mysql_query("delete x from my_Table x join my_Table z on x.subscriberEmail = z.subscriberEmailwhere x.id > z.id") or die (mysql_error());
RB. J
Can you explain how to read this code in english - It helps to be able to logically read something to understand the syntax - Also, now how do I prevent duplicates in the first place? I am already using Insert IGNORE but it does not ignore
RB. J
@RB I updated the answer a few days ago, can you let me know if it helped and accept the answer? Or let me know what else you need?
Fosco
A: 

If you have a unique id for each row, you can try something like this. Don't ask me why exactly you need the second select statement, mysql won't let me execute otherwise. Also, group by whatever columns make your results unique.

delete from my_table where id in (
  select id from (
    select id from my_table a group by subscriberEmail having count(*) > 1
  ) b
);
Brad Goss
I believe this would delete all of them, not just the extras.
Fosco
No it wouldn't. The group by will group like rows by subscriberEmail. So a subscriber with more than 1 email will be selected (having count(*) > 1). At this point you pretty much have a distinct record set of subscribers with more than 1 email. Take the ID from this record set and delete it. I tried it and it works like a charm.
Brad Goss
Running this code produces error "#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 2" - Code below
RB. J
mysql_query("DELETE from my_table where id in ( select id from ( select id from my_table a group by subscriberEmail having count(*) > 1 ) b") or die (mysql_error());
RB. J
did you replace my_table with the name of your table? Is the primary key on your table actually called id? Tune it so it works with your database, I just wrote it so it's generic.
Brad Goss
@Brad Ah yes you're right. I see that with the grouping you would only get one, but which one? He wants to keep the record with the lowest id for an email address.
Fosco
@Fosco oh I didn't notice that. I don't think you can safely say it'll select the highest id. Could probably alter it with an order by.
Brad Goss