data-warehouse

Which is better, ETL or ELT?

Having spent some time working on data warehousing, I have created both ETL (extract transform load) and ELT (extract load transform) processes. It seems that ELT is a newer approach to populating data warehouses that can more easily take advantage of cluster computing resources. I would like to hear what other people think the advantage...

How can I speed up queries against huge data warehouse tables with effective-dated data?

So I am querying some extremely large tables. The reason they are so large is because PeopleSoft inserts new records every time a change is made to some data, rather than updating existing records. In effect, its transactional tables are also a data warehouse. This necessitates queries that have nested selects in them, to get the most...

What open source DB is best suited to a dimensional datawarehouse?

What open source DB is best suited to host a dimensional data-warehouse? ...

How do I avoid complex joins in star schema?

My fact table holds a user score in a course he took. Some of the details of the course, which I have to show on the report, comes from more then one table (in the actual OLTP db). Do I create a none normalized version of that course entry in a dimension table? Or do I just join the fact table directly to the course table join to the oth...

SQL Server: unique key for batch loads

Hi, I am working on a data warehousing project where several systems are loading data into a staging area for subsequent processing. Each table has a "loadId" column which is a foreign key against the "loads" table, which contains information such as the time of the load, the user account, etc. Currently, the source system calls a sto...

what is Dim, what is Fact?

I have an application that I know would make a great cube and would be useful for more than the standard flat Reporting Services report. We're about to jump into BI stuff with a consultant, but I'd like to give it a shot before we do, mostly so I know something of what we're going to do. The application tracks surveys in nursing homes ...

How do I implement something like Digg Swarm in PHP or Python?

http://labs.digg.com/swarm/ ...

Achieving better DB performance

I have a website backed by a relational database comprised of the usual e-commerce related tables (Order, OrderItem, ShoppingCart, CreditCard, Payment, Customer, Address, etc...). The stored proc. which returns order history is painfully slow due to the amount of data + the numerous joins which must occur, and depending on the search pa...

Business Requirement Document (Data warehousing)

Can anyone please give me an example of Business Requirement Document for e-business with respect to Data Warehousing? ...

Doesn't anyone use a Customer Fact?

I am in the process of designing a new customer fact and dimensions for my warehouse. In my search for good example models I noticed something odd. No one seems to have a customer-centric fact. Every example I found has a transaction event such as a sale or order as the central fact with customers as a dimension. This raises a question f...

What are the open source tools and techniques to build a complete data warehouse platform?

I'm looking for these open source tools possibly free or with free trial version to set up complete data warehouse stack. I know about few like Pentaho open source Mondrian server, but couldn't get any google result to setup complete platform. I'm not sure whether these components are compatible with each other? Could someone please li...

Design of a data warehouse with more than one fact tables

Hi, I'm new to data warehousing. First, I wan't to precise than my copy of The Datawarehouse Toolkit is on it's way to my mailbox (snail mail :P). But I'm already studying all this stuff with what I find on the net. What I don't find on the net, however, is what to do when you seems to have more than one fact in a DW. In my case (insurr...

What is the difference between a database and a data warehouse?

What is the difference between a database and a data warehouse? Aren't they the same thing, or at least written in the same thing (ie. Oracle RDBMS)? ...

How Do I aggregate Data By Day and Still Respect Timezone?

We are currently using a summary table that aggregates information for our users on an hourly basis in UTC time. The problem we are having is that this table is becoming too large and slowing our system down immensely. We have done all the tuning techniques recommended for PostgreSQL and we are still experiencing slowness. Our idea ...

Data caching techniques / Tips / AppFabric

We have million and millions of records in a SQL table, and we run really complex analytics on that data to generate reports. As the table is growing and additional records are being added, the computation time is increasing and the user has to wait a long time before the webpage loads. We were thinking of using a distributed cache lik...

Data Warehousing Design Question

I'm developing a data warehouse and have come up against a problem I'm not sure how to fix. The current schema is defined below: DimInstructor <- Dimension table for instructors DimStudent <- Dimension table for students I want to implement a scenario whereby if details of an instructor change in my OLTP database, I want to add a new r...

ETL - Checking for updated dimension data

I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings. In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form). There are curr...

Sql Server Analysis Services Parent Child with non-unique key

I'm currently building our Data Warehouse, primarily using Ralph Kimball's methods and guidance. We are using the Microsoft stack for this (so SSIS, SSAS). I am a bit stuck deciding how to handle BOMS (Bill of Materials) which is effectively an unbalanced hierarchy. The BOM handles assemblies which are a collection of parts. Each part...

Database architecture for millions of new rows per day

I need to implement a custom-developed web analytics service for large number of websites. The key entities here are: Website Visitor Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc. I will need to do aggregated queries on this database s...

Does a lot of repeated sum(x) in different cols make a Select slower?

I have a really big table with dozens of columns and lots of rows. Lets call this table FT. Everyday I run a script that reads data from the FT table, perform some calculations, and update a smaller table (table FA) that I use to generate reports. The query that update FA is something like: INSERT INTO FA (A, B, C) (SELECT sum(X),...