views:

655

answers:

6

I have an application that requires analytics for different level of aggregation, and that's the OLAP workload. I want to update my database pretty frequently as well.

e.g., here is what my update looks like (schema looks like: time, dest, source ip, browser -> visits)

(15:00-1-2-2010, www.stackoverflow.com, 128.19.1.1, safari) -->  105

(15:00-1-2-2010, www.stackoverflow.com, 128.19.2.1, firefox) --> 110

...

(15:00-1-5-2010, www.cnn.com, 128.19.5.1, firefox) --> 110

And then I want to ask what is the total visit to www.stackoverflow.com from a firefox browser last month.

I understand Vertica system can do this in a relatively cheap way (performance and scalability wise, but not cost-wise probably). I have two questions here.

1) Is there an open-source product that I can build upon to solve this problem? In particular, how well does a Mondrian system work? (scalability, and performance) 2) Is there an HBase or Hypertable base solution (obviously, a naked HBase/Hypertable can't do this) for this? -- but if there is a project based on HBase/Hypertable, scalability probably won't be an issue IMO)?

Thanks!

+2  A: 

You can download a free edition (the single node edition) of the greenplum database. I haven't tried it myself but I think/guess it is a powerful beast. Read here: http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/

Another option is MongoDB, it is fast and free and you can write MapReduce functions with JavaScript to do analytics.

My reputation here is to low to add a hyperlink to mongodb, so you have to google . I can add only one hyper link per post.

AABBCCDD
greenplum is not free
charlie111
The single node edition is free.
AABBCCDD
+1  A: 

The zohmg project aims to solve this problem using Hadoop and HBase.

Michael Greene
A: 

What is the status of the "zohmg project"?

Any production deployment?

Thanks

charlie111
+1  A: 

Facebook also built Hive on-top of Hadoop. Pretty simple to get going - reasonable query API too.

http://mirror.facebook.net/facebook/hive/

stephbu
A: 

Is your data model more complex than that? If it isn't you might be beter of just writing custom code for it. Then you can really tune it to your data. Real products have to offer a lot of flexibility, need a lot of complexiy to achieve that, and suffer in speed as a result.

Your question is not clear in one aspect: when you talk about scalable, what do you mean by that? Are you collecting data from lots of sites but only have a limited amount of query users, or do you also have a lot of users? That situation leads to a significantly different model.

Stephan Eggermont
A: 

@Stephan Eggermont

I don't see data model matters here. The original post wants to find a solution of pre-computation (cuboid lattices) on top of Hbase and etc.

You are talking about Mapreduce, which extends what Hive is doing, but essentially it's a batch mode processing

charlie111
@Charlie111: please edit your question instead of adding an answer.
John Saunders