ansaurus

Question

SQL - Optimizing performance of bulk inserts and large joins?

Answer 1

A:

During stage 2 you know the primary key of each dimension you're inserting data into (after you've inserted it), but you're throwing this information away and rediscovering it in stage 3 with your "unholy" 9-way join.

Instead I'd recommend creating one sproc to insert into your fact table; e.g. insertXXXFact(...), which calls a number of other sprocs (one per dimension) following the naming convention getOrInsertXXXDim, where XXX is the dimension in question. Each of these sprocs will either look-up or insert a new row for the given dimension (thus ensuring referential integrity), and should return the primary key of the dimension your fact table should reference. This will significantly reduce the work you need to do in stage 3, which is now reduced to a call of the form insert into XXXFact values (DimPKey1, DimPKey2, ... etc.)

The approach we've adopted in our getOrInsertXXX sprocs is to insert a dummy value if one is not available and have a separate cleanse process to identify and enrich these values later on.

Adamski 2009-08-05 15:56:30

I agree in principle, but when I tried that approach, I found it to be 50% slower on average. It looks like the caching of the dimension tables combined with doing everything as bulk operations (rather than individual selects/inserts) is faster.

Rob 2009-08-05 19:19:20

@Rob: That's interesting as it's an approach that's worked for me in the past. BTW I cannot believe this answer was downvoted without any comment whatsoever!

Adamski 2009-08-12 09:52:34

Answer 2

+1 A:

I've tried several different approaches to trying to normalize the data incoming from a source as such and generally I've found the approach you're using now to be my choice. Its easy to follow and minor changes stay minor. Trying to return the generated id from one of the dimension tables during stage 2 only complicated things and usually generates far too many small queries to be efficient for large data sets. Postgres should be very efficient with your "unholy join" in modern versions and using "select distinct except select" works well for me. Other folks may know better, but I've found your current method to be my perferred method.

rfusca 2009-08-06 05:34:21

ansaurus

tags:

views:

answers:

SQL - Optimizing performance of bulk inserts and large joins?

related questions