views:

61

answers:

3

I understand that cubes are optimized data structures for aggregating and "slicing" large amounts of data. I just don't know how they are implemented.

I can imagine a lot of this technology is proprietary, but are there any resources that I could use to start implementing my own cube technology?

Set theory and lots of math are probably involved (and welcome as suggestions!), but I'm primarily interested in implementations: the data structures and query algorithms.

Thanks!

A: 

Generally, a data warehouse uses a relational database, but the tables aren't normalized like an operational relational database.

A data warehouse is subject oriented. Data warehouse subject tables usually have the following characteristics:

  • Many indexes.

  • No joins, except to look up tables.

  • Duplicated data, the subject table is highly denormalized.

  • Contains derived and aggregated information.

The database tables in a data warehouse are arranged in a star schema. A star schema is basically one subject table with an array of look up tables. The keys of the look up tables are foreign keys in the subject table. If you draw an entity relationship diagram of the subject table, the look up tables would surround the subject table like star points.

As far as the queries, that depends on the subject tables and the number of rows. Generally, expect queries to take a long time (many minutes, sometimes hours).

Here's a general article to get you started: Developing a Data Warehouse Architecture

Here's a high level overview of the design of a star schema: Designing the Star Schema Database

Gilbert Le Blanc
Thanks Gilbert - can you go into more detail on how the duplicated, derived and/or aggregated data is stored and accessed? I.e. hash tables keyed on the dimensional keys? perhaps containing a collection of aggregates, etc.?
Jeff Meatball Yang
Hi Jeff. Take a look at figure 10 of the star schema link. WalMart, to take a retailer, would have trillions of SalesFact rows. The queries against SalesFact would never end if it wasn't for the summary tables TimeDimension and StoreDimension. The summary tables answer most of the common queries.The actual data storage doesn't matter as much as designing the data warehouse so that most of the queries are answered by the summary tables.Now building and maintaining the summary tables, that takes a batch process running periodically.
Gilbert Le Blanc
@Gilbert Le Blanc Summary facts are not stored in dimensions. Those are facts that are just aggregated at a different grain (and perhaps with different dimensions), but they are still stored in fact tables.
Cade Roux
@Jeff Ok. You tell me what a dimension is.
Gilbert Le Blanc
@Gilbert Le Blanc I have given an example of generating aggregate summary information in my answer.
Cade Roux
+1  A: 

In a star-schema database, facts are usually acquired and stored at the finest grain.

So let's take the SalesFact example from Figure 10 in http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx

Right now, the grain is Product, Time (at a day granularity), Store.

Let's say you want that rolled up by month, pre-aggregated (this particular example is very unlikely to need pre-aggregation, but if the sales were detailed by customer, by minute, pre-aggregation might be necessary).

Then you would have a SalesFactMonthly (or add a grain discrimination to the existing fact table since the dimensions are the same - sometimes in aggregation, you may actually lose dimensions just like you can lose grain, for instance if you only wanted by store and not by product).

ProductID
TimeID (only linking to DayOfMonth = 1)
StoredID
SalesDollars

And you would get this by doing:

INSERT INTO SalesFactMonthly (ProductID, TimeID, StoreID, SalesDollars)
SELECT sf.ProductID
    ,(SELECT TimeID FROM TimeDimension WHERE Year = td.Year AND Month = td.Month AND DayOfMonth = 1) -- One way to find the single month dimension row
    ,sf.StoreID
    ,SUM(sf.SalesDollars)
FROM SalesFact AS sf
INNER JOIN TimeDimension AS td
    ON td.TimeID = sf.TimeID
GROUP BY td.Year, td.Month

What happens in cubes is you basically have fine-grain stars and pre-aggregates together - but every implementation is proprietary - sometimes you might not even have the finest-grain data in the cube, so it can't be reported on. But every way you might want to slice the data needs to be stored at that grain, otherwise you can't produce analysis that way.

Cade Roux
+2  A: 

There is a fantastic book that describes many internal details of SSAS implementation, including storage and query mechanism details:

http://www.amazon.com/Microsoft-Server-Analysis-Services-Unleashed/dp/0672330016

Michael