tags:

views:

35

answers:

3

Hi,

Please, excuse me in advance for my poor english as it is not my mother tongue.

We have a national application & the users would like to have accurate business statistics regarding some tables.

We are using tomcat, Spring Ws & hibernate on top of that.

We have think about many solutions :

  1. plain old query for each user request
    The problem is those tables contains millions of records. Every query will take many seconds at least. Solution never used.

  2. the actual solution used: create trigger.
    But it is painful to create & difficult to maintain (no OO, no cool EDI, no real debug).The only helping part is the possibility to create Junit Test on a higher level to verify the expected result.
    And for each different statistic on a table we have to create an other trigger for this table.

  3. Using the quartz framework to consolidate data after X minutes.

I have learn that databases are NOT DESIGNED for these heavy and complicated queries.

A separate data warehouse opyimize for reading only queries will be better. (OLAP??) But I don't have any clue where to start with postGresql. (pentaho is the solution or just a part?)

  1. How could we extract datas from the production database ? Using some extractor ?
  2. And when ?Every night ?
  3. If it is periodically - How will we manage to maintain near real time statistics if the data are just dumped on our datawarehouse one time per day ?

THanks in advance for your answer I really appreciate.

+1  A: 

"I have learn that databases are NOT DESIGNED for these heavy and complicated queries." Well you need to unlearn that. A database was designed for just these type of queries. I would blame bad design of the software you are using before I would blame the core technology.

StarShip3000
Bad design, the mother of all performance problems! And programmers who think they can outsmart a database ;)
Frank Heikens
A: 

I seems i have been misunderstood.

For those who think that a classic database is design for even processing real-time statistic with queries on billions datas, they might need to read articles on the origin of OLAP & why some people bother to design products around if the answer for performance was just a design question.

"I would blame bad design of the software you are using before I would blame the core technology." By the way, im not using any software (or pgadmin counts ?). I have two basic tables, you cant make it more simple,and the problem comes when you have billions datas to retreve for statistics.

For those who think it is just a design problm, im glad to hear their clever answer (no trigger i know this one) to a simple problem : Imagine you have 2 tables: employees & phones. An employee may have 0 to N phones. Now let say that you have 10 000 000 employees & 30 000 000 phones.

You final users want to know in real time :
1- the average number of phones per user
2-the avegarde age of user who have more than 3 phones
3-the averagae numbers of phones for employees who are in the company for more than 10 years

You have potentially 100 users that want those real time statistics at anytime.

Of course, any queries dont have to take more than 1/4 sec.

mada
A: 

Incrementally summarize the data..? The frequency depends on your requirements, and in extreme cases you may need more hardware, but this is very unlikely.

  1. Bulk load new data
  2. Calculate new status [delta] using new data and existing status
  3. Merge/update status
  4. Insert new data into permanent table (if necessary)
  5. NOTIFY wegotsnewdata
  6. Commit

StarShip3000 is correct, btw.

jwp