views:

995

answers:

3

I've already checked out the question http://stackoverflow.com/questions/633860/deleting-duplicate-records-using-a-temporary-table and it doesn't quite go far enough to assist me with this question:

I have a table of approximately 200,000 address locations hosted on a SQL 2000 Server. This table has a huge problem with duplicate data in the table caused by invalid input from various parties over the years. I need to output a list of duplicate records so I can begin the long process of cleaning them up.

So consider the following table structure:

Table Company(   
  CompanyId NVarChar(10) Not Null Constraint PK_Locations Primary Key,   
  CompanyName NVarChar(30),   
  CompanyAddress NVarChar(30),   
  CompanyCity NVarchar(30), 
  CompanyState Char(2),   
  CompanyZip NVarChar(10),   
  DateCreated DateTime, 
  LastModified DateTime,  
  LastModifiedUser NVarChar(64) 
)

For the first parse I'm not even going to worry about typos and variations of spelling yet which is going to be a greater nightmare down the road that I haven't even got the first clue about solving yet.

So for this part a record is considered to be duplicate when multiple records match on the following conditions:

(CompanyName Or CompanyAddress) And CompanyCity And CompanyState

Zip is excluded because so many of the locations are missing zip/postal codes and so many are entered incorrectly that it just makes for a far less accurate report if I include them.

I realize that there may legitimately be multiple locations for a company within a single city/state [for instance McDonalds, just off the top of my head], and there may legitimately be multple companies at a single address within a city and state [for instance inside a shopping mall or office building], but for now we will consider that these at least warrant some level of human attention and will include them in the report.

Matches on single fields are a piece of cake, but I'm coming unstuck when I get to multiple fields, especially when some are conditional.

+2  A: 
WITH q AS (
          SELECT Company.*,
                 ROW_NUMBER() OVER (PARTITION BY CompanyState, CompanyCity, CompanyName ORDER BY CompanyID) AS rnName,
                 ROW_NUMBER() OVER (PARTITION BY CompanyState, CompanyCity, CompanyAddress ORDER BY CompanyID) AS rnAddress
          FROM Company
          )
SELECT    *
WHERE     rnName > 1 OR rnAddress > 1

Note, though, that if your data will look like this:

CompanyID    CompanyName  CompanyAddress 
---------    -----------  --------------
1            McDonalds    Avenue 1
2            McDonalds    Avenue 2
3            Starbucks    Avenue 2

, then both records 2 and 3 will be deleted (which is what you requested but probably not what you wanted)

If you just want to list all rows having duplicates, then issue:

SELECT  *
FROM    Company co
WHERE   EXISTS
        (
        SELECT  1
        FROM    Company cn
        WHERE   cn.CompanyState = co.CompanyState
                AND cn.CompanyCity = co.CompanyCity
                AND cn.CompanyName = co.CompanyName
                AND cn.CompanyID <> co.CompanyID
        )
        OR EXISTS
        (
        SELECT  1
        FROM    Company ca
        WHERE   ca.CompanyState = co.CompanyState
                AND ca.CompanyCity = co.CompanyCity
                AND ca.CompanyAddress = co.CompanyAddress
                AND ca.CompanyID <> co.CompanyID
        )

This will work in SQL Server 2000 too.

Having indexes on (CompanyState, CompanyCity, CompanyName) and (CompanyState, CompanyCity, CompanyAddress) will greatly improve this query.

Quassnoi
I want all the duplicates listed, they're not going to be put straight into a deletion queue, so list all the dups is what I'm after. We have to merge invoices across from duplicates to the master record before all the dups are removed.
BenAlabaster
Should this work in SQL 2000 Server? It gives me the following errors:Msg 156, Level 15, State 1, Line 1Incorrect syntax near the keyword 'WITH'.Msg 195, Level 15, State 10, Line 3'ROW_NUMBER' is not a recognized function name.
BenAlabaster
No, this will not work in SQL Server 2000
Quassnoi
Perfect, your second response proved far more successful :)
BenAlabaster
A: 
SELECT
     C1.CompanyID,
     C2.CompanyID
FROM
     Company C1
INNER JOIN Company C2 ON
     (C2.CompanyName = C1.CompanyName OR C2.CompanyAddress = C1.CompanyAddress) AND
     C2.CompanyCity = C1.CompanyCity AND
     C2.CompanyState = C2.CompanyState AND
     C2.CompanyID > C1.CompanyID

If you have three or more matches then they will appear multiple times in the list. There are various ways to handle that depending on what exactly you want to get back from the query.

I would also strongly suggest that you look into better front-end coding to restrict how addresses are getting into your system as well as user training.

Tom H.
Hehe, yes, I was hired to assist with the better front end coding - I'm on it, but I'm also in the process of cleaning up the 5 years of mayhem going on before my arrival...
BenAlabaster
Okay, any suggestions for getting back only single instances of duplicates? I guess I could parse them with a cursor, but is there a query based way of doing that?
BenAlabaster
What do you define as a single instance and what would you want to see? For example, if you have company A and B that match on name, city, state and company C that matches A on address, city, state how would you want to see that?
Tom H.
That's a good question I suppose... I would consider that if A matches B on name, city and C matches A on address, city and state that each of these records should be displayed only a single time such that all 3 are considered "duplicate".
BenAlabaster
BenAlabaster
A: 

Try something like this...

Select field1, field2, ... etc, count(*)
FROM Company,
GROUP BY field1, field2, ...
HAVING count(*) > 1

This should show you what is duplicated.

Eppz
+1 Yep, this is totally the answer I was going to give.
Neil Barnwell
How will it show duplicates on (CompanyName OR CompanyAddress)?
Quassnoi
But wouldn't account for the *OR* situation... it only accounts if all the fields in the group by match
BenAlabaster
Easy enough to add a where clause to meet the conditions.
Eppz