views:

249

answers:

1

We're starting to load up a datawarehouse with data from event logs. We have a normal star schema where a row in the fact table represents one event. Our dimension tables are a typical combination of user_agent, ip, referal, page, etc. One dimension table looks like this:

create table referal_dim(
  id integer,
  domain varchar(255),
  subdomain varchar(255),
  page_name varchar(4096),
  query_string varchar(4096)
  path varchar(4096)
)

Where we autogenerate the id to eventually join against the fact table. My question is: whats the best way to identify duplicate records in our bulk load process? We upload all the records for a log file into temp tables before doing the actual insert into the persistent store, however, the id is just auto-incremented, so two identical dim records from two days would have different ids. Would creating a hash of the value columns be appropriate and then trying to compare on that? It seems like trying to compare on each value column would be slow. Is there any best practices for a situation like this?

+1  A: 

Auto-increment integer for a surrogate PK is OK, but (according to Mr. Kimball) a dimension table should also have a natural key too. So a hash NaturalKey column would be in order, also a Status column for "current" or "expired" may be useful to allow for SCD type 2.

Damir Sudarevic
Cool, a hashed naturalkey is what we were thinking about to go along with our surrogate id, but we weren't sure what the performance tradeoff would be.
igkins
To improve loading performance, you can maintain (re-create after each load) a key matching table for the dimension like (NaturalKey, PrKey) which matches natural keys with the latest primary key for the table. During loading, add a hash column to you records, than look-up the primary key from the KeyMatchingTable. If not found, means it is a new record so stage those for inserts. If found, means it already exists, so decide what to do (discard, SC1 or SC2). Then load the dimension table.
Damir Sudarevic