views:

40

answers:

3

I have a table that allows records to be inserted which are very similar but differ by a single column value or two. For example the following records are in the table:

ID    TITLE                                 URL                             COUNTRY

1494  Hollywood Reporter                    http://www.hollywoodreporter.com    USA
1497  Hollywood Reporter via Drudge Report  http://www.hollywoodreporter.com    NULL
2158  Hollywood Reporter via                http://www.hollywoodreporter.com    NULL

I would like to update the country column in the last two records where the url is the same. Also I would also like to know how to sort my table by url column so that all duplicate urls are grouped together or even if they are similar as you have in some cases like:

http://www.hollywoodreporter.com       http://www.blog.hollywoodreporter.com

Thanks in advance.

Tried the following and it worked where the urls were the same

UPDATE t1

SET t1.country = t2.country FROM Sources AS t1 JOIN sources AS t2 ON t1.url = t2.url WHERE t1.url= t2.url;

just got to figure out the rest. thanks everyone


UPDATE

I was able to edit the records that had matching urls but for the ones that are similar like http://www.pantagraph.com http://pantagraph.com http://pantagraph.com/titles

were not updated. When using the sort and select I can view all these records but when trying to update them it does not work. I even just tried this simple version of:

 select * from Sources s
 where s.url like url 

which shows the records but when updating it did not work.

update Sources 
set country = s.country 
from Sources s 
 where s.url like url
+2  A: 

You could use a subquery:

update  yt
set     country = (
                  select  distinct country 
                  from    YourTable yt2 
                  where   yt.url = yt2.url 
                          and yt2.country is not null
                  )
from    YourTable yt

This should give an error if there is an URL with different countries: in that case you should adapt the query to choose one of the countries.

Per your comment, to find rows with conflicting countries:

select  url
,       count(distinct country) as NumberOfCountries
from    YourTable
where   country is not null
group by
        url
,       country
having  count(distinct country) > 1
Andomar
yeah i got the following error;Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=,When I do the following:select * from Sources swhere url = s.url I see the records I want to update but just cannot figure out how.
vbNewbie
@vbNewbie: Added a conflict seeking query to the answer, hope it helps
Andomar
+1  A: 

For sorting the URL, try adding another column for just "domain name". You would extract the domain name (hollywoodreporter.com), store it, and sort on that column. Alternatively, you could use regular expressions in your ORDER BY clause, but that could lead to performance problems.

Marcus Adams
+1  A: 
update table
set
    table.country = t2.country
from
    table t2
where
    table.url = t2.url
    and t2.country is not null

You're joining the table to itself based on url and updating from rows that have something in the country field.

But, if you've got that much duplicated data, your design is probably bad. Try normalizing the db if you can.

Sean
If there were multiple countries per URL, this would overwrite country multiple times (not necessarily bad)
Andomar
Thanks for the response. I was told that above as well but with this table in particular having more than 2million records I am afraid I may comprimise the data. I have removed duplicates and some columns contain the same value but the records themselves are unique.
vbNewbie