tags:

views:

59

answers:

1

If I needed to do a left join between 2 tables (in order to run some kind of analysis between them), but both datasets are too large for this to be executed in a single query, what's the best practice to accomplish this?

I saw FETCH in the documentation but wasn't sure if this is conventionally used to loop over entire datasets. Since I figured this task had to be commonplace, I wasn't going to kill myself trying to improperly hodgepodge FETCH or OFFSET in order to accomplish my analysis.

Note: This is a local database, and will not be altered through the duration of the procedure - so performance considerations and transactions aren't a factor.

I'm using PostgreSQL, but I'm sure the practice is similar across all modern DBMS.

+1  A: 

I agree with the comments that a modern DBMS should be able to join any table that they can store. Sometimes you have to tell the database not to try a hash join on gigantic tables; hash joins are very fast, but not for joins where the hash doesn't fit in memory. For PostreSQL, you can disable hash joins with:

SET ENABLE_HASHJOIN TO FALSE

Having said that, some databases do perform better if you split a query in smaller batches. You can use subqueries to partition a join in batches:

select  *
from    (
        select  *
        from    YourTable1
        where   CustomerName like 'A%'
        ) a
left join 
        (
        select  *
        from    YourTable2
        where   CustomerName like 'A%'
        ) b
on      a.CustumerName = b.CustomerName

This only helps the database if there is an efficient way to filter. In the example, that would be an index on CustomerName in both tables.

Andomar