views:

48

answers:

4

I have a requirement to loop through records in a database table and group items that have similar content. I want to match on a single column in the database and if there are similar records I want to extract the ID of each row and save it to another table e.g. if I had 10 similar rows they would be linked to one "header" record in another table.

Below is some simple Pseudocode to illustrate what I need to do:

For Each record in table

     If There is a similar record in header table Then
      Link this record to matching header table record 
     Else
      Create new Header record and link this record
     End If

End For

I'm using MSSQL 2008 with Full Text Search which will provide me with the mechanism I need to pick out similar records. At the moment I'm planning to create the four loop in C# Code and do the matching and the saving in SQL by calling a stored procedure to check for the matching record.

Something is telling me this should all be done in single stored procedure (and something else tells me keep logic in the code!).

Is there a neater way of doing this in SQL?

A: 

Here is an example..try changing it to your needs.

SELECT email, 
 COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
HAVING ( COUNT(email) > 1 )
Misnomer
@Misnomer: thanks for the example, however in this example it would only work on exact duplicates. I need to check for similar records that may not be exactly the same.
BradB
you could add to the having clause another condition `OR email LIKE '%similar%' ` to check for similiar items..
Misnomer
@Misnomer: I plan to use FTS as the LIKE operator isn't sophisticated enough for my requirements. Have you ever used an FTS JOIN in the style of your example? Do-able?
BradB
A: 

You may want to look into the MERGE statement that is new in SQL Server 2008. See, for example: Inserting, Updating, and Deleting Data by Using MERGE.

Joe Stefanelli
A: 

you can write a sproc and schedule a maintenance plan to run, or you can use embedded c# code on sql server, so you can build better algorithms easly in db side with c#. or you can write a windows service for a batch processing job that can run regulary.

sirmak
A: 

Databases are really good at dealing with distinct pieces of information. They are not so good at dealing with quasi-distinct information.

With that said, see if the soundex function works (well enough) for grouping similar inputs.

And, for the love of god, don't use anything like this in a production environment.

JoshRoss