views:

331

answers:

6

In SQL Server 2005, I have a table of input coming in of successful sales, and a variety of tables with information on known customers, and their details. For each row of sales, I need to match 0 or 1 known customers.

We have the following information coming in from the sales table:
ServiceId, Address, ZipCode, EmailAddress, HomePhone, FirstName, LastName

The customers information includes all of this, as well as a 'LastTransaction' date.

Any of these fields can map back to 0 or more customers. We count a match as being any time that a ServiceId, Address+ZipCode, EmailAddress, or HomePhone in the sales table exactly matches a customer.

The problem is that we have information on many customers, sometimes multiple in the same household. This means that we might have John Doe, Jane Doe, Jim Doe, and Bob Doe in the same house. They would all match on on Address+ZipCode, and HomePhone--and possibly more than one of them would match on ServiceId, as well.

I need some way to elegantly keep track of, in a transaction, the 'best' match of a customer. If one matches 6 fields, and the others only match 5, that customer should be kept as a match to that record. In the case of multiple matching 5, and none matching more, the most recent LastTransaction date should be kept.

Any ideas would be quite appreciated.

Update: To be a little more clear, I am looking for a good way to verify the number of exact matches in the row of data, and choose which rows to associate based on that information. If the last name is 'Doe', it must exactly match the customer last name, to count as a matching parameter, rather than be a very close match.

A: 

I would probably create a stored function for that (in Oracle) and oder on the highest match

SELECT * FROM (
 SELECT c.*, MATCH_CUSTOMER( Customer.Id, par1, par2, par3 ) matches FROM Customer c
) WHERE matches >0 ORDER BY matches desc

The function match_customer returns the number of matches based on the input parameters... I guess is is probably slow as this query will always scan the complete customer table

Janco
A: 

For close matches you can also look at a number of string similarity algorithms.

For example, in Oracle there is the UTL_MATCH.JARO_WINKLER_SIMILARITY function:
http://www.psoug.org/reference/utl%5Fmatch.html

Joeri Sebrechts
A: 

There is also the Levenshtein distance algorithym.

John MacIntyre
+1  A: 

Here's a fairly ugly way to do this, using SQL Server code. Assumptions:
- Column CustomerId exists in the Customer table, to uniquely identify customers.
- Only exact matches are supported (as implied by the question).

SELECT top 1 CustomerId, LastTransaction, count(*) HowMany
 from (select Customerid, LastTransaction
        from Sales sa
         inner join Customers cu
          on cu.ServiceId = sa.ServiceId
       union all select Customerid, LastTransaction
        from Sales sa
         inner join Customers cu
          on cu.EmailAddress = sa.EmailAddress
       union all select Customerid, LastTransaction
        from Sales sa
         inner join Customers cu
          on cu.Address = sa.Address
           and cu.ZipCode = sa.ZipCode
       union all [etcetera -- repeat for each possible link]
      ) xx
 group by CustomerId, LastTransaction
 order by count(*) desc, LastTransaction desc

I dislike using "top 1", but it is quicker to write. (The alternative is to use ranking functions and that would require either another subquery level or impelmenting it as a CTE.) Of course, if your tables are large this would fly like a cow unless you had indexes on all your columns.

Philip Kelley
+3  A: 

for SQL Server 2005 and up try:

;WITH SalesScore AS (
SELECT
    s.PK_ID as S_PK
        ,c.PK_ID AS c_PK
        ,CASE 
             WHEN c.PK_ID IS NULL THEN 0
             ELSE CASE WHEN s.ServiceId=c.ServiceId THEN 1 ELSE 0 END
                  +CASE WHEN (s.Address=c.Address AND s.Zip=c.Zip) THEN 1 ELSE 0 END
                  +CASE WHEN s.EmailAddress=c.EmailAddress THEN 1 ELSE 0 END
                  +CASE WHEN s.HomePhone=c.HomePhone THEN 1 ELSE 0 END
         END AS Score
    FROM Sales s
        LEFT OUTER JOIN Customers c ON s.ServiceId=c.ServiceId
                                       OR (s.Address=c.Address AND s.Zip=c.Zip)
                                       OR s.EmailAddress=c.EmailAddress
                                       OR s.HomePhone=c.HomePhone 
)
SELECT 
    s.*,c.*
    FROM (SELECT
              S_PK,MAX(Score) AS Score
              FROM SalesScore 
              GROUP BY S_PK
         ) dt
        INNER JOIN Sales          s ON dt.s_PK=s.PK_ID 
        INNER JOIN SalesScore    ss ON dt.s_PK=s.PK_ID AND dt.Score=ss.Score
        LEFT OUTER JOIN Customers c ON ss.c_PK=c.PK_ID

EDIT I hate to write so much actual code when there was no shema given, because I can't actually run this and be sure it works. However to answer the question of the how to handle ties using the last transaction date, here is a newer version of the above code:

;WITH SalesScore AS (
SELECT
    s.PK_ID as S_PK
        ,c.PK_ID AS c_PK
        ,CASE 
             WHEN c.PK_ID IS NULL THEN 0
             ELSE CASE WHEN s.ServiceId=c.ServiceId THEN 1 ELSE 0 END
                  +CASE WHEN (s.Address=c.Address AND s.Zip=c.Zip) THEN 1 ELSE 0 END
                  +CASE WHEN s.EmailAddress=c.EmailAddress THEN 1 ELSE 0 END
                  +CASE WHEN s.HomePhone=c.HomePhone THEN 1 ELSE 0 END
         END AS Score
    FROM Sales s
        LEFT OUTER JOIN Customers c ON s.ServiceId=c.ServiceId
                                       OR (s.Address=c.Address AND s.Zip=c.Zip)
                                       OR s.EmailAddress=c.EmailAddress
                                       OR s.HomePhone=c.HomePhone 
)
SELECT
    *
    FROM (SELECT 
              s.*,c.*,row_number() over(partition by s.PK_ID order by s.PK_ID ASC,c.LastTransaction DESC) AS RankValue
              FROM (SELECT
                        S_PK,MAX(Score) AS Score
                        FROM SalesScore 
                        GROUP BY S_PK
                   ) dt
                  INNER JOIN Sales          s ON dt.s_PK=s.PK_ID 
                  INNER JOIN SalesScore    ss ON dt.s_PK=s.PK_ID AND dt.Score=ss.Score
                  LEFT OUTER JOIN Customers c ON ss.c_PK=c.PK_ID
         ) dt2
    WHERE dt2.RankValue=1
KM
Thank you, I'm looking into whether that will work for me, but that so far seems to be what I'm looking for!
Brisbe42
you can customize this by adding as many "cases" to your scoring logic and weight them differently (add more than or less than the others)
KM
This is the best solution posted to-date. How would you factor in LastTransaction, to resolve ties?
Philip Kelley
@Philip Kelley, to handle the ties, use a _row_number()... AS x_ and then _WHERE x=1_ see my edit...
KM
Yep, that's what I'd do. I was hoping you might have some way of avoiding that extra subquery to "wrap" the ranking function.
Philip Kelley
+1  A: 

Frankly I would be wary of doing this at all as you do not have a unique identifier in your data.

John Smith lives with his son John Smith and they both use the same email address and home phone. These are two people but you would match them as one. We run into this all the time with our data and have no solution for automated matching because of it. We identify possible dups and actually physically call and find out id they are dups.

HLGEM
Thank you for the input. We do have some protections that aren't noted above, to try and prevent this--many of them dealing with situations where a last name and address are the same (and grouping the customers into a 'household' instead, for our purposes). I was mostly using the first name/last name/address example to try and simplify the intended logic for a solution, rather than give a complete idea of the database's structure.
Brisbe42
I one wrote something similar for matching names where we made use of a thesaurus for other names e.g. Bob<->Robert and sounds like queries for matching things like Mohamed -> Muhamed, Muhamad etc
pjp