views:

24

answers:

2

Hi, I have a very large database with about 120 Million records in one table.I have clean up the data in this table first before I divide it into several tables(possibly normalizing it). The columns of this table is as follows: "id(Primary Key), userId, Url, Tag " . This is basically a subset of the dataset from delicious website. As I said, each row has an id, userID a url and only "one" tag. So for example a bookmark in delicious website is composed of several tags for a single url, this corresponds to several lines of my database. for example: "id"; "user" ;"url" ;"tag" "38";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"ajax" "39";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"api" "40";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"javascript" "41";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"library" "42";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"rails"

If I want to see the number of tags for each "distinct" url I run the below query.

SELECT DISTINCT url,tag,COUNT(tag) as "TagCount" FROM urltag GROUP BY url

Now I want to delete the records that have less than 5 tags associated with their urls. Does anyone knows the actual query that I have to run? thanks

A: 
delete from urltag where url in (SELECT DISTINCT url FROM urltag GROUP BY url HAVING count(tag) < 5)

should do it. but your request doesn't specifically take into account that a few different userIds could have submitted the same url...

oedo
I get error when I run this query:"You can't specify target table 'urltag' for update in FROM clause
Hossein
ah in that case you might not be able to do it in mysql. i think ms-sql can deal with this case though. i guess your only option is to do it in 2 queries then, get the distinct urls that have count(tag) < 5, and then a delete query on those urls.
oedo
thank you for ur info
Hossein
A: 

You don't need SELECT DISTICT url, ... when you do GROUP BY url. I'd rewrite your query like this: from

SELECT DISTINCT url,tag,COUNT(tag) as "TagCount" FROM urltag GROUP BY url

to

SELECT url, COUNT(tag) as "TagCount" FROM urltag GROUP BY url

Placing tag column in the select clause will not provide useful data. If a column is not mentioned in the GROUP BY clause, the values returned for it will be random, tipically min(tag).

So if you want to remove all rows containing urls for which less than 5 tags were associated, you can do this:

You can add a flag to your table, such as:

alter table urltag 
    add column todelete tinyint(4) not null default 0,
    add key(todelete);

Then you can do

update urltag u 
inner join (
    SELECT url, count(tag) tagcount 
    FROM urltag GROUP BY url
    )big on big.url = t.url
set t.todelete = 1
where big.tagcount < 5;

Then, just

delete from urltag where todelete = 1;
ceteras