ansaurus

Question

Answer 1

A:

What was wrong with your original JOIN? that should perform much faster than this function.

This should do it, but I think it will be really slow:

SELECT
...
FROM Table1             t1
    CROSS JOIN Table2   t2
WHERE dbo.MatchRanking(t1.Latitude ,..,..,t2.Latitude ,..)=1 --"1" or whatever return value is a match

KM 2010-04-13 19:46:16

yes you are correct, as a solutions it has potential but it is too slow to be viable. Im coming table A with X rows against table B which is already at 55k and will increase

tigermain 2010-04-13 19:59:02

Answer 2

A:

Personally, I would do this kind of fuzzy data matching in several passes.

-create an xref table which contain the keys for the records that match -get all that match exactly and insert the keys in the xref table

-"fuzzify" your criteria and search again but only in those records that do not already have a match in the xref.

-rinse and repeat by expanding and/or fuzzifying your criteria until the matches you get are garbage.

Jeremy 2010-04-13 19:52:55

I've updated the question to explain, its not as simple as I think your solution would provide

tigermain 2010-04-13 19:59:50

Answer 3

A:

You will have to cross join the two tables (m and n yields m x n compares) first and compare to find matches - there is really no other simple way.

However, with meta-understanding of the data and its interpretation and your goals, if you can somehow filter your set to eliminate items from the cross join fairly easily, that would help - especially if there's anything that has to be an exact match, or any way to partition the data so that items in different partitions would never be compared (i.e. comparing a US and Europe location would always have match rank 0)

I would say that you could use the Lat and Long to eliminate completely if they differ by a certain amount, but it looks like they are used to improve the match ranking, not to negatively eliminate items from being match ranked.

And scalar functions (your FuzzyMatches) called repeatedly (like in a multi-million row cross join) are tremendously expensive.

It looks to me like you could extrac the first match and the inner else to take place in your cross join (and inline if possible, not as a UDF) so that they can be somewhat optimized by the query optimizer in conjunction with the cross join instead of in a black box called m x n times.

Another possibility is to pre-extract only the distinct Phone number pairs, post code pairs etc.

SELECT Postcode1, Postcode2, dbo.FuzzyLogicStringMatch(Postcode1, Postcode2) AS MatchRank
FROM (
    SELECT DISTINCT Postcode AS Postcode1
    FROM Table1
) AS Postcodes1
CROSS JOIN
(
    SELECT DISTINCT Postcode AS Postcode2
    FROM Table2
) AS Postcodes2

If your function is symmetric, you can further reduce this space over which to call the UDF with some extra work (it's easier if the table is a self-join, you just use an upper right triangle).

Now you have the minimal set of compares for your scalar UDF to be called over. Put this result into a table indexed on the two columns.

You can do similar for all your UDF parameter sets. If the function definition isn't changing, you only need to add new combinations to your table over time, turning an expensive scalar function call into a table-lookup on relatively slower growing data. You can even use a CASE statement to fall back on the UDF call inline if the lookup table doesn't have an entry - so you can decide whether to keep the lookup tables comprehensive or not.

Cade Roux 2010-04-13 20:26:22

Answer 4

A:

One thing we did was set up a spearate table to store the results of the cross join so they only have to be calcluated once. Then have a job that runs nightly to pick up any new records and populated them against all the old records. After all the lat/longs are not going to change (except for the occasional type which the nightly job can find and fix) and it doesn't make sense to do this calculation everytime you run a query to find the distances when the calulation will alwys be the same for most of the numbers. Once this data is in a table, you can easily query very quickly. Populating the table the first time might take awhile.

HLGEM 2010-04-13 20:57:27

ansaurus

tags:

views:

answers:

comparing data via a function

related questions