views:

37

answers:

2

Is there a way to search for fields that contain similar values in a sql db? For example I have a table of over a million records where one column contains url values and is associated with a country column. Previously I tried to match urls that are equal where it contained a null value for the country as was able to update it using the following:

UPDATE t1
SET t1.country = t2.country
FROM Sources AS t1
JOIN sources AS t2
ON t1.url = t2.url;

Then I altered the query to use the like word as follows:

UPDATE t1
SET t1.country = t2.country
FROM Sources AS t1
JOIN sources AS t2
ON t1.url = t2.url
WHERE t1.url like t2.url;

when I just use the select statement to find the records where urls are like then I get the results but the update does not work. A better example is as follows:

These are all the same domain url and I just want to update the country column for each one to avoid doing it manually because there are around 200000 to do.

+1  A: 

How about:

 UPDATE t1
 SET t1.country = t2.country
 FROM Sources AS t1
 JOIN sources AS t2
 ON t1.url LIKE t2.url

See what kind of joins you get when you run that on your dataset...it may make too many bad matches.

At some point you'll probably need to do some matching based on exact portions of the url, but i don't know how to do that in a query like this. See this links for info:

http://www.w3schools.com/SQL/sql_wildcards.asp

Oh and if all the URLs contain the http://www. portion you could always do something like

 WHERE left(t1.url,16) = left(t1.url,16)

That might cut down on your execution time and enfore better joins

Albert
I'll try it again. As I mentioned in the above comment it was taking a long time; got a bit impatient. Will post back. Thanks for the response and the link.
vbNewbie
Where are you checking to see if a country record is null? If you're trying to pull country data from t1 to t2, then why don't you put a WHERE t2.country IS NULL in there to cut down on processing time. You only wanna fix the NULL records right?
Albert
right..let me add that.
vbNewbie
@Albert - It should be noted that `WHERE left(t1.url,16) = left(t1.url,16)` will cause a table scan and will likely increase the processing time but provide better "likeness" testing.
Thomas
update - the update is still running, is this unusual...since the update using t1.url = t2.url did not take this long and processor is at 100%
vbNewbie
+1  A: 

First, I don't see how your two queries could return different results. In both cases, you are requiring that the Url match exactly in the ON clause.

Second, what do you suppose will happen if there are two rows in the Sources table with identical Urls but different countries? This is one of the reasons to be careful with using the FROM clause in an UPDATE statement (and one of the reasons it is not officially supported by the ANSI standard). In the case of two rows with identical Urls but different countries, you need to decide which one will win. Let's suppose that the lowest sorted one will win:

Update Sources
Set Country =   (
                    Select Min(T2.Country)
                    From Sources As T2
                    Where T2.Url = Sources.Url
                    )

If you want "similar" urls, we first need to know how you are defining "similar". The use of LIKE with no wildcard is effectively the same as using =. If you use a leading wildcard (e.g. LIKE ('%' + Url)), you end up doing a table scan on your million row table. However, if we use only a trailing wildcard (e.g. LIKE (Url + '%')), then SQL can still use an index:

Update Sources
Set Country =   (
                    Select Min(T2.Country)
                    From Sources As T2
                    Where T2.Url Like ( Sources.Url + '%' )
                    )

EDIT

In a later comment, it was hinted that you only wanted to update rows with NULL Country values. If that is true, it requires a trivial addition of a WHERE clause:

Update Sources
Set Country =   (
                    Select Min(T2.Country)
                    From Sources As T2
                    Where T2.Url = Sources.Url
                    )
Where Country Is Null
Thomas