ansaurus

Question

How can I choose the closest match in SQL Server 2005?

Answer 1

A:

I would probably create a stored function for that (in Oracle) and oder on the highest match

SELECT * FROM (
 SELECT c.*, MATCH_CUSTOMER( Customer.Id, par1, par2, par3 ) matches FROM Customer c
) WHERE matches >0 ORDER BY matches desc

The function match_customer returns the number of matches based on the input parameters... I guess is is probably slow as this query will always scan the complete customer table

Janco 2009-08-27 20:54:30

Answer 2

A:

For close matches you can also look at a number of string similarity algorithms.

For example, in Oracle there is the UTL_MATCH.JARO_WINKLER_SIMILARITY function:
http://www.psoug.org/reference/utl%5Fmatch.html

Joeri Sebrechts 2009-08-27 21:00:10

Answer 3

A:

There is also the Levenshtein distance algorithym.

John MacIntyre 2009-08-27 21:01:31

Answer 4

+1 A:

Here's a fairly ugly way to do this, using SQL Server code. Assumptions:
- Column CustomerId exists in the Customer table, to uniquely identify customers.
- Only exact matches are supported (as implied by the question).

SELECT top 1 CustomerId, LastTransaction, count(*) HowMany
 from (select Customerid, LastTransaction
        from Sales sa
         inner join Customers cu
          on cu.ServiceId = sa.ServiceId
       union all select Customerid, LastTransaction
        from Sales sa
         inner join Customers cu
          on cu.EmailAddress = sa.EmailAddress
       union all select Customerid, LastTransaction
        from Sales sa
         inner join Customers cu
          on cu.Address = sa.Address
           and cu.ZipCode = sa.ZipCode
       union all [etcetera -- repeat for each possible link]
      ) xx
 group by CustomerId, LastTransaction
 order by count(*) desc, LastTransaction desc

I dislike using "top 1", but it is quicker to write. (The alternative is to use ranking functions and that would require either another subquery level or impelmenting it as a CTE.) Of course, if your tables are large this would fly like a cow unless you had indexes on all your columns.

Philip Kelley 2009-08-27 21:08:47

Answer 5

+3 A:

for SQL Server 2005 and up try:

;WITH SalesScore AS (
SELECT
    s.PK_ID as S_PK
        ,c.PK_ID AS c_PK
        ,CASE 
             WHEN c.PK_ID IS NULL THEN 0
             ELSE CASE WHEN s.ServiceId=c.ServiceId THEN 1 ELSE 0 END
                  +CASE WHEN (s.Address=c.Address AND s.Zip=c.Zip) THEN 1 ELSE 0 END
                  +CASE WHEN s.EmailAddress=c.EmailAddress THEN 1 ELSE 0 END
                  +CASE WHEN s.HomePhone=c.HomePhone THEN 1 ELSE 0 END
         END AS Score
    FROM Sales s
        LEFT OUTER JOIN Customers c ON s.ServiceId=c.ServiceId
                                       OR (s.Address=c.Address AND s.Zip=c.Zip)
                                       OR s.EmailAddress=c.EmailAddress
                                       OR s.HomePhone=c.HomePhone 
)
SELECT 
    s.*,c.*
    FROM (SELECT
              S_PK,MAX(Score) AS Score
              FROM SalesScore 
              GROUP BY S_PK
         ) dt
        INNER JOIN Sales          s ON dt.s_PK=s.PK_ID 
        INNER JOIN SalesScore    ss ON dt.s_PK=s.PK_ID AND dt.Score=ss.Score
        LEFT OUTER JOIN Customers c ON ss.c_PK=c.PK_ID

EDIT I hate to write so much actual code when there was no shema given, because I can't actually run this and be sure it works. However to answer the question of the how to handle ties using the last transaction date, here is a newer version of the above code:

;WITH SalesScore AS (
SELECT
    s.PK_ID as S_PK
        ,c.PK_ID AS c_PK
        ,CASE 
             WHEN c.PK_ID IS NULL THEN 0
             ELSE CASE WHEN s.ServiceId=c.ServiceId THEN 1 ELSE 0 END
                  +CASE WHEN (s.Address=c.Address AND s.Zip=c.Zip) THEN 1 ELSE 0 END
                  +CASE WHEN s.EmailAddress=c.EmailAddress THEN 1 ELSE 0 END
                  +CASE WHEN s.HomePhone=c.HomePhone THEN 1 ELSE 0 END
         END AS Score
    FROM Sales s
        LEFT OUTER JOIN Customers c ON s.ServiceId=c.ServiceId
                                       OR (s.Address=c.Address AND s.Zip=c.Zip)
                                       OR s.EmailAddress=c.EmailAddress
                                       OR s.HomePhone=c.HomePhone 
)
SELECT
    *
    FROM (SELECT 
              s.*,c.*,row_number() over(partition by s.PK_ID order by s.PK_ID ASC,c.LastTransaction DESC) AS RankValue
              FROM (SELECT
                        S_PK,MAX(Score) AS Score
                        FROM SalesScore 
                        GROUP BY S_PK
                   ) dt
                  INNER JOIN Sales          s ON dt.s_PK=s.PK_ID 
                  INNER JOIN SalesScore    ss ON dt.s_PK=s.PK_ID AND dt.Score=ss.Score
                  LEFT OUTER JOIN Customers c ON ss.c_PK=c.PK_ID
         ) dt2
    WHERE dt2.RankValue=1

KM 2009-08-27 21:10:06

Thank you, I'm looking into whether that will work for me, but that so far seems to be what I'm looking for!

Brisbe42 2009-08-27 21:12:06

you can customize this by adding as many "cases" to your scoring logic and weight them differently (add more than or less than the others)

KM 2009-08-27 21:17:08

This is the best solution posted to-date. How would you factor in LastTransaction, to resolve ties?

Philip Kelley 2009-08-28 14:01:19

@Philip Kelley, to handle the ties, use a _row_number()... AS x_ and then _WHERE x=1_ see my edit...

KM 2009-08-28 14:32:12

Yep, that's what I'd do. I was hoping you might have some way of avoiding that extra subquery to "wrap" the ranking function.

Philip Kelley 2009-08-28 18:36:44

Answer 6

+1 A:

Frankly I would be wary of doing this at all as you do not have a unique identifier in your data.

John Smith lives with his son John Smith and they both use the same email address and home phone. These are two people but you would match them as one. We run into this all the time with our data and have no solution for automated matching because of it. We identify possible dups and actually physically call and find out id they are dups.

HLGEM 2009-08-27 21:49:06

Thank you for the input. We do have some protections that aren't noted above, to try and prevent this--many of them dealing with situations where a last name and address are the same (and grouping the customers into a 'household' instead, for our purposes). I was mostly using the first name/last name/address example to try and simplify the intended logic for a solution, rather than give a complete idea of the database's structure.

Brisbe42 2009-08-27 21:57:08

I one wrote something similar for matching names where we made use of a thesaurus for other names e.g. Bob<->Robert and sounds like queries for matching things like Mohamed -> Muhamed, Muhamad etc

pjp 2009-08-28 15:14:45

ansaurus

tags:

views:

answers:

How can I choose the closest match in SQL Server 2005?

related questions