I've already checked out the question http://stackoverflow.com/questions/633860/deleting-duplicate-records-using-a-temporary-table and it doesn't quite go far enough to assist me with this question:
I have a table of approximately 200,000 address locations hosted on a SQL 2000 Server. This table has a huge problem with duplicate data in the table caused by invalid input from various parties over the years. I need to output a list of duplicate records so I can begin the long process of cleaning them up.
So consider the following table structure:
Table Company(
CompanyId NVarChar(10) Not Null Constraint PK_Locations Primary Key,
CompanyName NVarChar(30),
CompanyAddress NVarChar(30),
CompanyCity NVarchar(30),
CompanyState Char(2),
CompanyZip NVarChar(10),
DateCreated DateTime,
LastModified DateTime,
LastModifiedUser NVarChar(64)
)
For the first parse I'm not even going to worry about typos and variations of spelling yet which is going to be a greater nightmare down the road that I haven't even got the first clue about solving yet.
So for this part a record is considered to be duplicate when multiple records match on the following conditions:
(CompanyName Or CompanyAddress) And CompanyCity And CompanyState
Zip is excluded because so many of the locations are missing zip/postal codes and so many are entered incorrectly that it just makes for a far less accurate report if I include them.
I realize that there may legitimately be multiple locations for a company within a single city/state [for instance McDonalds, just off the top of my head], and there may legitimately be multple companies at a single address within a city and state [for instance inside a shopping mall or office building], but for now we will consider that these at least warrant some level of human attention and will include them in the report.
Matches on single fields are a piece of cake, but I'm coming unstuck when I get to multiple fields, especially when some are conditional.