views:

174

answers:

3

I'm looking for these open source tools possibly free or with free trial version to set up complete data warehouse stack.

I know about few like Pentaho open source Mondrian server, but couldn't get any google result to setup complete platform. I'm not sure whether these components are compatible with each other? Could someone please list them along with their position in the chain?

Thanks.

A: 

A datawarehouse stack (or suite) usually consists of three layers. These are usually referenced as ETL (loading), Database & Reporting (interface). In addition, there are somewhat more advanced tools for performance and expert needs. These consist of Cubes and Statistical Analysis Tools.

As far as interoperability goes, the ETL tools and the reporting tools need to support whatever database you are using. However, since there are only two big open source databases, there is usually no problem mixing different solutions.

As for specifics -

1 - ETL

Data loading can be achieved by open-source tools such as Pentaho's Data Integration or Talend (an eclipse extension). I would suggest googling "open source etl" to tailor the solution for your specific needs.

2 - DB

You'll need a relational database (RDBMS). The two most prominent open-source players are PostgreSQL (used by Stack Overflow) and MySQL. While MySQL has a larger user base, Postgres is gaining more an more popularity ever since implementing several crucial features that were missing in earlier versions.

3 - Reporting

Pentaho offer reporting platform. So is BIRT (another eclipse extension). Again, Google is your friend for specific comparisons. Note that when if you choose Pentaho for both the ETL and Reporting tools you are likely to enjoy a better integration. You've also mentioned Mondrian, which is a tool to generate MDX queries over an RDBMS. MDX is the standard language for querying cubes.

At this point of time, assuming you are starting from scratch, I would recommend setting up the first two layers of the data warehouse - ETL & DB. You can later add any number of reporting tools above.

shmichael
A: 

This is another similar question http://stackoverflow.com/questions/354231/20-billion-rows-month-hbase-hive-greenplum-what

The most relevant part:

I cannot stress this enough: Get something that plays nicely with off-the-shelf reporting tools.

.

Hive or HBase put you in the business of building a custom front-end, which you really don't want unless you're happy to spend the next 5 years writing custom report formatters in Python.

Sandeep
+5  A: 

The Open Source Data Warehousing does a great job at identifying OSS components that could be used to build a Data Warehouse stack: Infrastructure (servers, OS, databases), Integration Management (ETL, EAI, etc), Information Management (DW/Mart/ODS, OLap Servers, etc), Information Delivery (Portal, Dashboard, Analytics/OLAP Client, etc). Here is a summary:

Open Source BI/DW Projects

BI and Analytics

Databases

Integration

I recommend browsing the presentation. Good stuff.

Pascal Thivent