views:

137

answers:

3

I have a primary table for Articles that is linked by a join table Info to a table Tags that has only a small number of entries. I want to split the Articles table, by either deleting rows or creating a new table with only the entries I want, based on the absence of a link to a certain tag. There are a few million articles. How can I do this?

Not all of the articles have any tag at all, and some have many tags.

Example:

table Articles
  primary_key id
table Info
  foreign_key article_id
  foreign_key tag_id
table Tags
  primary_key id

It was easy for me to segregate the articles that do have the match right off the bat, so I thought maybe I could do that and then use a NOT IN statement but that is so slow running it's unclear if it's ever going to finish. I did that with these commands:

INSERT INTO matched_articles SELECT * FROM articles a LEFT JOIN info i ON a.id = i.article_id WHERE i.tag_id = 5;
INSERT INTO unmatched_articles SELECT * FROM articles a WHERE a.id NOT IN (SELECT m.id FROM matched_articles m);

If it makes a difference, I'm on Postgres.

+1  A: 

Your queries look ok except the first one should be an inner join, not a left join. If you want to try something else, consider this:

INSERT INTO matched_articles 
SELECT * 
FROM articles a 
INNER JOIN info i ON a.id = i.article_id 
WHERE i.tag_id = 5;

INSERT INTO unmatched_articles 
SELECT * 
FROM articles a 
LEFT JOIN info i ON a.id = i.article_id AND a.id <> 5
WHERE a.id IS NULL

That might be faster but really, what you have is probably ok if you only have to do it once.

Michael Haren
+1  A: 
INSERT INTO matched_articles 
SELECT * FROM articles a LEFT JOIN info i ON a.id = i.article_id WHERE i.tag_id = 5; 

INSERT INTO unmatched_articles 
SELECT * FROM articles a WHERE a.id NOT IN (SELECT m.id FROM matched_articles m); 

There's so much wrong here, I'm not sure where to start. OK in your first insert you do not need a left join in fact you don't actually have one. It should be

INSERT INTO matched_articles 
SELECT * FROM articles a INNER JOIN info i ON a.id = i.article_id WHERE i.tag_id = 5; 

Had you needed a left join you would have had

INSERT INTO matched_articles 
SELECT * FROM articles a LEFT JOIN info i ON a.id = i.article_id AND i.tag_id = 5; 

When you put something from the right side of a left join into the where clause (other than searching for the null values), then you convert it to an inner join becasue it must meet that condition, therefore the records that don't have a match inthe right table are elimiated.

Now the second statement can be done with a special case of the left join, although what you have will work.

INSERT INTO matched_articles 
SELECT * FROM articles a 
LEFT JOIN info i ON a.id = i.article_id AND i.tag_id = 5
WHERE i.tag_id is null

This will give you all the records that are in the info table except those that matched the articles table.

Now the next thing, you should not write insert staments without specifying the fields you want to insert. Nor should you ever write a select statement using select * especially if you have a join. This is generally sloppy, lazy coding and should be fixed. What if someone changed the structure of one of the tables but not the other? This kind of thing is bad for maintenance and in the case of a select statment with a join, it is returning a collumn twice (the join column) and that is a waste of server and network resources. It is just poor coding to be too lazy specify what you need and only what you need. So get out of the habit and don't do it again for any production code.

If you current stament is too slow, you may also be able to fix it with the right indexes. Are the id fields indexed on both tables? Onthe other hand if there are millionas of articles, it is going to take time to insert them. It is often better to do this in batches maybe 50000 at a time (fewer still if this takes too long). Just do the insert ina loop that selects the top XXX records and then loops until the row count affected is none.

HLGEM
Thanks a lot for your help. I'm really just getting started off with working with databases, and your feedback was super helpful. Your insert statement for matches worked great, got done in seconds what had already been running for hours with my previous method.
WIlliam Jones
+1  A: 

Not sure, if Postgres has a concept of a temporary table.
Here is how this can be done, as well.

CREATE Table #temp
AS SELECT A.ID, COUNT(i.*) AS Total
FROM Articles A
LEFT JOIN info i
ON A.id = i.Article_ID AND i.Tag_ID = 5
GROUP BY A.ID

INSERT INTO Matched_Articles
SELECT A.*
FROM Articles A INNER JOIN #temp t
ON A.ID = t.Article_ID AND T.Total = 0

DELETE FROM #Temp
WHERE Total = 0

INSERT INTO UnMatched_Articles
SELECT A.*
FROM Articles AINNER JOIN #temp t
ON A.ID = t.Article_ID

Note that I am not using any editor to try this out.
I hope this gives you hint on how I would approach this.

shahkalpesh