views:

137

answers:

2

We have million and millions of records in a SQL table, and we run really complex analytics on that data to generate reports.

As the table is growing and additional records are being added, the computation time is increasing and the user has to wait a long time before the webpage loads.

We were thinking of using a distributed cache like AppFabric to load the data in memory when the application loads and then running our reports off that data in memory. This should improve the response time a little since now data is in memory vs disk.

Before we take the plundge and implement this I wanted to check and find out what others are doing and what are some of the best techniques and practices to load data in memory, caching etc. Surely you don't just load the entire table with 100s of millions of records in memory...??

I was also looking into OLAP / Data warehousing, which might give us better performance rather than caching.

A: 

We have million and millions of records in a SQL table,

Bad policy. Flat files are better.

and we run really complex analytics on that data to generate reports.

In some cases, you'd be happier loaded relevant subsets into SQL.

As the table is growing and additional records are being added, the computation time is increasing

That's the consequence of using a database for too much. Use it for less.

We were thinking of using a distributed cache like AppFabric...

Perhaps. Flat files, however, are fast and more scalable than RDBMS.

was also looking into OLAP / Data warehousing

Good plan. Buy Kimball's book immediately. You don't need more technology. You only need to make better use of flat files as primary and SQL as a place for ad-hoc queries (against subsets) for users.

S.Lott
I feel that "Bad policy. Flat files are better." is a bit of a sweeping statement. Could you elborate? It's quite common to see 100 millions of rows in a RDBMS table, and have it perform well: indexes, partitioning, indexed views spring to mind...
Mitch Wheat
@Mitch Wheat. Flat files are simpler (and therefore faster) than any RDBMS. If the data is simply accumulated and then analyzed, that's what flat files are best for. Please buy Kimball's book. DW is done most simply with flat files for bulk and SQL for ad-hoc queries.
S.Lott
@S.Lott: I already have Kimball's book. Several in fact...
Mitch Wheat
@S.Lott: Could you tell me where Kimball makes the statement that flat files are always faster. Thx.
Mitch Wheat
@Mitch Wheat: Kimball suggests that the "dimension bus" should be flat files. Flat files are faster than an RDBMS for trivial append and retrieve operations because they're simpler. Do some measurements of file system read and write compared with RDBMS Insert/Select. RDBMS == File System + Overheads for locking.
S.Lott
which kimball book ?
ace
@ace: Any of them. Start with The Data Warehouse Tool Kit.
S.Lott
Flat files are definitely *not* better when you're slicing and dicing. e.g. When your query needs to aggregate only a subset of the rows. As the number of different slices required increases, the attractiveness of flat files decreases. Also, they lack good tools to ensure data consistency, availability etc.
shmichael
@shmichael: Correct. Fine-grained, ad-hoc slicing and dicing is better done in SQL. That's the "datamart" approach. Take a major slice through the flat-file and load the relevant subset into a database for final analysis. I'm not sure what you're talking about with "data consistency" or "availability" with flat files -- the OS handles this. It's trivial. RDBMS's make this complex; the OS already does this.
S.Lott
@S.Lott: Today I just tested our database vs flat files, you are right on...there is no comparison...a SQL query that was taking me about 8 mins to load up my objects...takes me about 15 seconds to read from a serialized file on disk. I was blown away..
ace
@ace: Databases add overhead for flexibility and concurrency. In most analytic situations you need neither flexibility nor concurrent writing. For analytics a database is rarely relevant. Look into Hadoop. http://hadoop.apache.org/
S.Lott
+2  A: 

The solution to complex reporting is to pre-calculate, so you're on the right path if you're looking at OLAP.

Alex