views:

389

answers:

5

Hello

note: to the editors: please edit the title if have a better one :)

my question is:

I have two tables in my database

     -----------
     | table1   |
     |----------|
     | id       |
     |text      |
     ===========


     -----------
     | table2   |
     |----------|
     | id       |
     |text      |
     ===========

table1 is 600,000 records

table2 is 5,000,000 records !!:)

what is the best way to delete all the records in table2 that are not in table1

I main by the way -the fastest way because I don't want to wait 4 hours to complete the process

do you have something better than the following code:

<?PHP
   $sql = "select text from table2";
   $result = mysql_query($sql) or die(mysql_error());
   while($row = mysql_fetch_array($result)){
        $text = $row["text"];
        $sql2 = "select id from table1 where text = '$text'";
        $query2 = mysql_query($sql2) or die(mysql_error());
        $result2 = mysql_num_rows($query2);
        if($result2==0){
             $sql3 = "delete from table2 where text = '$text'";
             $query3 = mysql_query($sql3) or die(mysql_error());
        }
   }
?>

Thanks

+4  A: 

what about letting the RDBM handle it ?

for example

DELETE FROM table2 WHERE text NOT IN (select distinct text from table1)

Cheers

PS: do some backup before testing ...

RageZ
Could deleting such a big amount of records overflow the transaction log ?
mjv
depend if he does it a transaction but yeah if 'yes' the transaction record would greatly increase
RageZ
+1  A: 

Why not add a new column to table2 that is one byte and then just do an update setting that byte to true or 'Y' if that row is in both tables.

Then, just delete the rows that don't have this one column set.

That would seem to be the simplest and fastest, IMO.

James Black
+3  A: 

Your solution is doing something like 2 queries per line in the table2 table -- which means a couple of million queries -- which will be rather slow ^^

Using MySQL, you might be able to delete all this in only one query : the DELETE instruction can be used to delete data from multiple-tables.

First thing is to write the select instruction that will match the data you want to delete (it's a better way to test than trying a delete without knowing if it'll really deal with the right data) ; something like this might do :

select table2.*
from table2
    left join table1 on table1.text = table2.text
where table1.id is NULL

This should get you all data that is in table2, but is not in table1.

Once you are sure this query is getting the right data, you can transform it to a delete query :

delete table2
from table2
    left join table1 on table1.text = table2.text
where table1.id is NULL

This might do -- of course, it would be best to first test on a test database, and not on your production one !

Else, something with an IN and a subquery might do ; a bit like

delete
from table2
where text not in (select text from table1)

Not sure what will be faster, though, considering the amount of data you have -- still either way, I would not do the kind of PHP loop you proposed, but would go with a SQL query that can delete everything by itself : avoiding all those calls from PHP to the DB will most certainly make things faster !

Pascal MARTIN
+1  A: 

Try this:

DELETE table2 Where id NOT IN (SELECT id from table1)

Note: Plz keep backup before running the query

Himadri
A: 

Create table3 like table2 insert into table3 (SELECT table2.ID, TABle2.TEXT from table1 join table2 on ...) drop table2 alter table3 new name table2

Involves a bit of management (so it's only a valid option if you can easily drop/alter tables), but at least the DML part will outperform any other option, methinks.