tags:

views:

156

answers:

2

I am doing ETL for log files into a PostgreSQL database, and want to learn more about the various approaches used to optimize performance of loading data into a simple star schema.

To put the question in context, here's an overview of what I do currently:

  1. Drop all foreign key and unique constraints
  2. Import the data (~100 million records)
  3. Re-create the constraints and run analyze on the fact table.

Importing the data is done by loading from files. For each file:

1) Load the data from into a temporary table using COPY (the PostgreSQL bulk upload tool)

2) Update each of the 9 dimension tables with any new data using an insert for each such as:

INSERT INTO host (name)
SELECT DISTINCT host_name FROM temp_table
EXCEPT
SELECT name FROM host;
ANALYZE host;

The analyze is run at the end of the INSERT with the idea of keeping the statistics up to date over the course of tens of millions of updates (Is this advisable or necessary? At minimum it does not seem to significantly reduce performance).

3) The fact table is then updated with an unholy 9-way join:

INSERT INTO event (time, status, fk_host, fk_etype, ... ) 
SELECT t.time, t.status, host.id, etype.id ... 
FROM temp_table as t 
JOIN host ON t.host_name = host.name
JOIN url ON t.etype = etype.name
... and 7 more joins, one for each dimension table

Are there better approaches I'm overlooking?

A: 

During stage 2 you know the primary key of each dimension you're inserting data into (after you've inserted it), but you're throwing this information away and rediscovering it in stage 3 with your "unholy" 9-way join.

Instead I'd recommend creating one sproc to insert into your fact table; e.g. insertXXXFact(...), which calls a number of other sprocs (one per dimension) following the naming convention getOrInsertXXXDim, where XXX is the dimension in question. Each of these sprocs will either look-up or insert a new row for the given dimension (thus ensuring referential integrity), and should return the primary key of the dimension your fact table should reference. This will significantly reduce the work you need to do in stage 3, which is now reduced to a call of the form insert into XXXFact values (DimPKey1, DimPKey2, ... etc.)

The approach we've adopted in our getOrInsertXXX sprocs is to insert a dummy value if one is not available and have a separate cleanse process to identify and enrich these values later on.

Adamski
I agree in principle, but when I tried that approach, I found it to be 50% slower on average. It looks like the caching of the dimension tables combined with doing everything as bulk operations (rather than individual selects/inserts) is faster.
Rob
@Rob: That's interesting as it's an approach that's worked for me in the past. BTW I cannot believe this answer was downvoted without any comment whatsoever!
Adamski
+1  A: 

I've tried several different approaches to trying to normalize the data incoming from a source as such and generally I've found the approach you're using now to be my choice. Its easy to follow and minor changes stay minor. Trying to return the generated id from one of the dimension tables during stage 2 only complicated things and usually generates far too many small queries to be efficient for large data sets. Postgres should be very efficient with your "unholy join" in modern versions and using "select distinct except select" works well for me. Other folks may know better, but I've found your current method to be my perferred method.

rfusca