views:

374

answers:

3

I have an Access 2003 table with ~4000 records which was made from 17 different tables. Roughly half of these records are duplicates. There is no unique identifying column (id, name etc). There is an id column which was auto filled when the tables were combined meaning that the duplicates aren't completely identical (though this column could be removed if it makes things easier).

I have used the Access Find Duplicates Query Wizard which gives me a list of the duplicated records but won't let me delete them (seriously what use is this query if I can't delete them?). I've tried converting the generated query to a remove query but that changes the number of rows that it finds. I'd alter the sql by hand but it's a bit beyond me and is 7 lines long.

Does anyone know a good way of getting rid of the duplicates?

+1  A: 

Use a select with all columns except the ID column:

SELECT DISTINCTROW Column1, Column2, Column3 
INTO MYNEWTABLE
FROM TABLE

You can simply swap the names.

This solution will give you a new table with non duplicates.

Raj More
Maybe I'm doing something wrong but that just duplicates the whole table.
Mr_Chimp
+10  A: 

The reason the find duplicates query won't let you delete the records is because it is basically just an aggregate query, it is counting the number of duplicates it finds and returning the cases where the count is greater than 1.

Consider that if you did make a delete query based on the find duplicates, it would delete all rows that have duplicate values, which is maybe not what you want. You want to delete all but one of the duplicates.

You should try to delete all duplicates of a record apart from one, excluding the ID column in your comparison. I suggest the simplest way to do this is to make a make-table query of all the unique values (Select Distinct Field1, Field2... from MyTable) instead for every field except for the ID field, using the results in a to create a new table of around 2000 records (if half are duplicates).

Then, create an ID column on your new table, use an update query to update this ID to the first matching ID in the original table (you could do this using DLookup, which will return the first EXPRESSION value where CRITERIA is true in DOMAIN).

The DLookup() function returns one value from a single field even if more than one record satisfies the criteria. If no record satisfies the criteria, or if the domain contains no records, DLookup() returns a Null.

Since you are identifying the first matching ID based on all the other fields, which are unique values, the unmatched IDs will belong to duplicates. You will be reversing the PK relation, identifying the first matching key given a set of unique fields. After that, you should set the ID to be PK. Of course this assumes the ID has no inherent meaning, and you don't care about keeping one particular ID for a given duplicated row over any of the IDs belonging to the other duplicated rows. This assumes you care about the data in the ID column so you want to preserve it for all remaining rows, otherwise just ignore the DLookup step and do a Select Distinct on all columns apart from the ID.

Dale Halliwell
Aha! Those first two paragraphs make a lot of sense!I've made a query as follows:SELECT DISTINCT blah1, blah2, blah3, etc INTO newtable FROM allrecords...which has given me 2144 records, which seems about right. Has this not just created a copy of the table without the duplicates? I.e. do I actually need to the DLookup bit?
Mr_Chimp
Yes, it has copied all the rows. You don't need to do the DLookup bit unless you need to keep the IDs for all the rows.
Dale Halliwell
When @Dale Halliwell says "The reason the find duplicates query won't let you delete the records is because it is basically just an aggregate query" he is making a false statement. The find duplicates wizard uses an aggregate query in an IN subquery in the WHERE clause, but the base query should be editable, as long as the recordsource you chose to de-dupe was editable to begin with.
David-W-Fenton
@David W Fenton, FYI @Dale Halliwell is correct, I can not delete directly with the find-duplicates query yet I can delete from the base table. I recently have the same problem as OP but with much larger tables, I could not solve it until today. This method works for me so +1
rkarajan
+1  A: 

The following will preserve original IDs and do it in one step:

DELETE FROM table_with_duplicates 
WHERE table_with_duplicates.id NOT IN 
    (SELECT max(id) 
    FROM table_with_duplicates 
    GROUP BY duplicated_field_1, duplicated_field_2, ...
    )

Now you have the original table with no duplicates and preserved ids. And always remember to backup you data before trying large DELETEs.

avguchenko
I don't think this would work. Each record has a separate id, i.e. the duplicates are identical in every way EXCEPT for the id column. If there was a unique identifying column I'd be finished by now!
Mr_Chimp
in your question you mention that the id is autofilled which i take to mean auto-incremented, so this query will return duplicate grouped records and count ids. it would work indeed. try it!
avguchenko