views:

67

answers:

3

Suppose I have the following data:

OrderNumber  |  CustomerName  |  CustomerAddress  | CustomerCode
          1  |  Chris         |  1234 Test Drive  |          123
          2  |  Chris         |  1234 Test Drive  |          123

How can I detect that the columns "CustomerName", "CustomerAddress", and "CustomerCode" all correlate perfectly? I'm thinking that Sql Server data mining is probably the right tool for the job, but I don't have too much experience with that.

Thanks in advance.

UPDATE:

By "correlate", I mean in the statistics sense, that whenever column a is x, column b will be y. In the above data, The last three columns correlate with each other, and the first column does not.

The input of the operation would be the name of the table, and the output would be something like :

         Column 1     |    Column 2          | Certainty
      CustomerName    |  CustomerAddress     | 100%
      CustomerAddress |  CustomerCode        | 100%
A: 

What do you mean by correlate? Do you just want to see if they're equal? You can do that in T-SQL by joining the table to itself:

select distinct
    case when a.OrderNumber < b.OrderNumber then a.OrderNumber 
        else b.OrderNumber 
        end as FirstOrderNumber,
    case when a.OrderNumber < b.OrderNumber then b.OrderNumber 
        else a.OrderNumber 
        end as SecondOrderNumber
from
    MyTable a
    inner join MyTable b on
        a.CustomerName = b.CustomerName
        and a.CustomerAddress = b.CustomerAddress
        and a.CustomerCode = b.CustomerCode

This would return you:

FirstOrderNumber  |  SecondOrderNumber
               1  |                  2
Eric
A: 

Correlation is defined on metric spaces, and your values are not metric.

This will give you percent of customers that don't have customerAddress uniquely defined by customerName:

SELECT  AVG(perfect)
FROM    (
        SELECT  customerName, CASE WHEN COUNT(customerAddress) = COUNT(DISTINCT customerAddress) THEN 0 ELSE 1 END AS perfect
        FROM    orders
        GROUP BY
                customerName
        ) q

Substitute other columns instead of customerAddress and customerName into this query to find discrepancies between them.

Quassnoi
A: 

There is a 'functional dependency' test built in to the SQL Server Data Profiling component (which is an SSIS component that ships with SQL Server 2008). It is described pretty well on this blog post:

http://blogs.conchango.com/jamiethomson/archive/2008/03/03/ssis-data-profiling-task-part-7-functional-dependency.aspx

I have played a little bit with accessing the data profiler output via some (under-documented) .NET APIs and it seems doable. However, since my requirement dealt with distribution of column values, I ended up going with something much simpler based on the output of DBCC STATISTICS. I was quite impressed by what I saw of the profiler component and the output viewer.

Paul Harrington
Thanks dude...I knew I had seen something like this before, I just couldn't remember where.
Chris B. Behrens