Hi, I have a very large database with about 120 Million records in one table.I have clean up the data in this table first before I divide it into several tables(possibly normalizing it). The columns of this table is as follows: "id(Primary Key), userId, Url, Tag " . This is basically a subset of the dataset from delicious website. As I said, each row has an id, userID a url and only "one" tag. So for example a bookmark in delicious website is composed of several tags for a single url, this corresponds to several lines of my database. for example: "id"; "user" ;"url" ;"tag" "38";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"ajax" "39";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"api" "40";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"javascript" "41";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"library" "42";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"rails"
If I want to see the number of tags for each "distinct" url I run the below query.
SELECT DISTINCT url,tag,COUNT(tag) as "TagCount" FROM urltag GROUP BY url
Now I want to delete the records that have less than 5 tags associated with their urls. Does anyone knows the actual query that I have to run? thanks