I've already checked out the question http://stackoverflow.com/questions/633860/deleting-duplicate-records-using-a-temporary-table and it doesn't quite go far enough to assist me with this question:
I have a table of approximately 200,000 address locations hosted on a SQL 2000 Server. This table has a huge problem with duplicate data in the table caused by invalid input from various parties over the years. I need to output a list of duplicate records so I can begin the long process of cleaning them up.
So consider the following table structure:
Table Company(
CompanyId NVarChar(10) Not Null Constraint PK_Locations Primary Key,
CompanyName NVarChar(30),
CompanyAddress NVarChar(30),
CompanyCity NVarchar(30),
CompanyState Char(2),
CompanyZip NVarChar(10),
DateCreated DateTime,
LastModified DateTime,
LastModifiedUser NVarChar(64)
For the first parse I'm not even going to worry about typos and variations of spelling yet which is going to be a greater nightmare down the road that I haven't even got the first clue about solving yet.
So for this part a record is considered to be duplicate when multiple records match on the following conditions:
(CompanyName Or CompanyAddress) And CompanyCity And CompanyState
Zip is excluded because so many of the locations are missing zip/postal codes and so many are entered incorrectly that it just makes for a far less accurate report if I include them.
I realize that there may legitimately be multiple locations for a company within a single city/state [for instance McDonalds, just off the top of my head], and there may legitimately be multple companies at a single address within a city and state [for instance inside a shopping mall or office building], but for now we will consider that these at least warrant some level of human attention and will include them in the report.
Matches on single fields are a piece of cake, but I'm coming unstuck when I get to multiple fields, especially when some are conditional.