views:

59

answers:

3

I want to perform an SQL query that is logically equivalent to the following:

DELETE FROM pond_pairs
WHERE
  ((pond1 = 12) AND (pond2 = 233)) OR
  ((pond1 = 12) AND (pond2 = 234)) OR
  ((pond1 = 12) AND (pond2 = 8)) OR
  ((pond1 = 13) AND (pond2 = 6547)) OR
  ((pond1 = 13879) AND (pond2 = 6))

I will have hundreds of thousands pond1-pond2 pairs. I have an index on (pond1, pond2).

My limited SQL knowledge came up with several approaches:

  1. Run the whole query as is.
  2. Batch the query up into smaller queries with n WHERE conditions
  3. Save the pond1-pond2 pairs into a new table, and do a subquery in the WHERE clause to identify
  4. Convert the python logic which identifies rows to delete into a stored procedure. Note that I am unfamiliar with programming stored procedures and thus this would probably involve a steep learning curve.

I am using postgres if that is relevant.

+1  A: 

I will do 3. (with JOIN rather than subquery) and measure time of DELETE query (without creating table and inserting). This is good starting point, because JOINing is very common and optimized procedure, so It will be hard to beat that time. Then you can compare that time to your current approach.

Also you can try following approach:

  1. Sort pairs in same way as in index.
  2. Delete using method 2. from your description (probably in single transaction).

Sorting before delete will give better index reading performance, because there's greater chance for hard-drive cache to work.

Tomasz Wysocki
DELETE works against JOINed tables?
Thilo
Yes, you have example in Frank Heikens answer.
Tomasz Wysocki
That USING clause is neat. But he still needs to send the pairs into the database (unless they are already there somewhere).
Thilo
I'm not suggesting that this is final solution. Temporary table deletion is great point of reference, because it will be very hard to delete records faster. So If one of other propositions will have similar speed, it will be good choice.
Tomasz Wysocki
+1  A: 

For a large number of pond1-pond2 pairs to be deleted in a single DELETE, I would create temporary table and join on this table.

-- Create the temp table:
CREATE TEMP TABLE foo AS SELECT * FROM (VALUES(1,2), (1,3)) AS sub (pond1, pond2);

-- Delete
DELETE FROM bar 
USING  
  foo -- the joined table
WHERE 
  bar.pond1= foo.pond1 
AND 
  bar.pond2 = foo.pond2;
Frank Heikens
Filling the TEMP TABLE with the pairs is an equivalent problem to the original DELETE question, though (unless the pairs are already in the database somewhere).
Thilo
No it's not, you can use COPY to fill the temp table. This a MUCH faster than any other option to get the data into your temp table. I just gave a very simple example, but the idea is the same.
Frank Heikens
Can you show how to use COPY to fill the temp table?
Thilo
@Thilo: Just check http://python.projects.postgresql.org/docs/1.0/copyman.html
Frank Heikens
I see. 'receive_stmt = destination.prepare("COPY loading_table FROM STDIN")' would be a good way to put these numbers into the table.
Thilo
A: 

With hundred of thousands of pairs, you cannot do 1 (run the query as is), because the SQL statement would be too long.

3 is good if you have the pairs already in a table. If not, you would need to insert them first. If you do not need them later, you might just as well run the same amount of DELETE statements instead of INSERT statements.

How about a prepared statement in a loop, maybe batched (if Python supports that)

  1. begin transaction
  2. prepare statement "DELETE FROM pond_pairs WHERE ((pond1 = ?) AND (pond2 = ?))"
  3. loop over your data (in Python), and run the statement with one pair (or add to batch)
  4. commit

Where are the pairs coming from? If you can write a SELECT statements to identify them, you can just move this condition into the WHERE clause of your delete.

DELETE FROM pond_pairs WHERE (pond1, ponds) in (SELECT pond1, pond2 FROM ......  )
Thilo