We're starting to load up a datawarehouse with data from event logs. We have a normal star schema where a row in the fact table represents one event. Our dimension tables are a typical combination of user_agent, ip, referal, page, etc. One dimension table looks like this:
create table referal_dim(
id integer,
domain varchar(255),
subdomain varchar(255),
page_name varchar(4096),
query_string varchar(4096)
path varchar(4096)
)
Where we autogenerate the id to eventually join against the fact table. My question is: whats the best way to identify duplicate records in our bulk load process? We upload all the records for a log file into temp tables before doing the actual insert into the persistent store, however, the id is just auto-incremented, so two identical dim records from two days would have different ids. Would creating a hash of the value columns be appropriate and then trying to compare on that? It seems like trying to compare on each value column would be slow. Is there any best practices for a situation like this?