20 Billion Rows/Month - Hbase / Hive / Greenplum / What ?

views:

4159

answers:

+14 Q:

20 Billion Rows/Month - Hbase / Hive / Greenplum / What ?

Hi,

I'd like to use your wisdom for picking up the right solution for a data-warehouse system. Here are some details to better understand the problem:

Data is organized in a star schema structure with one BIG fact and ~15 dimensions.
20B fact rows per month
10 dimensions with hundred rows (somewhat hierarchy)
5 dimensions with thousands rows
2 dimensions with ~200K rows
2 big dimensions with 50M-100M rows

Two typical queries run against this DB

Top members in dimq:
select top X dimq, count(id)
from fact
where dim1 = x and dim2 = y and dim3 = z
group by dimq
order by count(id) desc

Measures against a tuple:
select count(distinct dis1), count (distinct dis2), count(dim1), count(dim2),...
from fact
where dim1 = x and dim2 = y and dim3 = z
questions:
1. What is the best platform to perform such queries
2. What kind of hardware needed
3. Where can it be hosted (EC2?)

(please ignore importing and loading issues at the moment)

Tnx,
Haggai.

+19 A:

I cannot stress this enough: Get something that plays nicely with off-the-shelf reporting tools.

20 Billion rows per month puts you in VLDB territory, so you need partitioning. The low cardinality dimensions would also suggest that bitmap indexes would be a performance win.

Forget the cloud systems (Hive, Hbase) until they have mature SQL support. For a data warehouse application you want something that works with conventional reporting tools. Otherwise, you will find yourself perpetually bogged down writing and maintaining ad-hoc report programs.
The data volumes are manageable with a more conventional DBMS like Oracle - I know of a major European telco that loads 600 billion rows per day into an Oracle database. All other things being equal, that's two orders of magnitude bigger than your data volumes, so shared disk architectures still have headroom for you. A shared-nothing architecture like Netezza or Teradata will probably be faster still but these volumes are not at a level that is beyond a conventional shared-disk system. Bear in mind, though, that these systems are all quite expensive.
Also bear in mind that MapReduce is not an efficient query selection algorithm. It is fundamentally a mechanism for distributing brute-force computations. Greenplum does have a MapReduce back-end, but a purpose-built shared nothing engine will be a lot more efficient and get more work done for less hardware.

My take on this is that Teradata or Netezza would probably be the ideal tool for the job but definitely the most expensive. Oracle, Sybase IQ or even SQL Server would also handle the data volumes involved but will be slower - they are are shared disk architectures but can still manage this sort of data volume. See This posting for a rundown on VLDB related features in Oracle and SQL Server, and bear in mind that Oracle has just introduced the Exadata storage platform also.

My back-of-a-fag-packet capacity plan suggests maybe 3-5TB or so per month including indexes for Oracle or SQL Server. Probably less on Oracle with bitmap indexes, although an index leaf has a 16-byte ROWID on oracle vs. a 6 byte page reference on SQL Server.

Sybase IQ makes extensive use of bitmap indexes and is optimised for data warehouse queries. Although a shared-disk architecture, it is very efficient for this type of query (IIRC it was the original column-oriented architecture). This would probably be better than Oracle or SQL Server as it is specialised for this type of work.

Greenplum might be a cheaper option but I've never actually used it so I can't comment on how well it works in practice.

If you have 10 dimensions with just a few hundred rows consider merging them into a single junk dimension which will slim down your fact table by merging the ten keys into just one. You can still implement hierarchies on a junk dimension and this would knock 1/2 or more off the size of your fact table and eliminate a lot of disk usage by indexes.

I strongly recommend that you go with something that plays nicely with a reasonable cross-section of reporting tools. This means a SQL front end. Commercial systems like Crystal Reports allow reporting and analytics to be done by people with a more readily obtainable set of SQL skills. The open-source world has also generated BIRT, Jasper Reports and Pentaho.. Hive or HBase put you in the business of building a custom front-end, which you really don't want unless you're happy to spend the next 5 years writing custom report formatters in Python.

Finally, host it somewhere you can easily get a fast data feed from your production systems. This probably means your own hardware in your own data centre. This system will be I/O bound; it's doing simple processing on large volumes of data. This means you will need machines with fast disk subsystems. Cloud providers tend not to support this type of hardware as it's an order of magnitude more expensive than the type of disposable 1U box traditionally used by these outfits. Fast Disk I/O is not a strength of cloud architectures.

ConcernedOfTunbridgeWells 2008-12-09 21:49:01

SQL Server should also be able to handle this and has it's own reporting service as well as support for Crytal Reports.

HLGEM 2008-12-09 22:44:58

Yes, people certainly do multi-terabyte SQL Server data warehouse systems - I'd say it should be able to cope with 20 billion rows/month.

ConcernedOfTunbridgeWells 2008-12-09 22:52:58

With 15 dimensions? I still remember the 3-dimensional 1TByte demo, which made MS the laughing stock of the industry

Stephan Eggermont 2008-12-09 23:09:07

The biggest SQL Server warehouse system I'm aware of is around 60TB at a UK building society. It's not as good for VLDB work as Oracle and certainly nowhere in the same league as Teradata but you can do it.

ConcernedOfTunbridgeWells 2008-12-09 23:13:37

With 4 years of data this is over 26 TB of primary data, so this is larger. I'm not sure you should want to be the largest site of any product.

Stephan Eggermont 2008-12-09 23:20:56

Don't neglect to investigate Oracle's new Exadata platform either.

David Aldridge 2008-12-10 18:11:26

Oracle might be able to handle it, if you can afford the DBAs. My experience is that it requires endless tuning and handholding. Greenplum works nicely, but was sorely lacking in the concurrency department last time I looked (a couple of years ago). I hear they've improved on that since.

SquareCog 2009-01-19 17:17:29

How many users and what response times do you need?

Are you focussing here on the lone specialist with a blade rack and his monthly reports or do you want to give real-time worldwide access to thousands of end-users?

19 dimensions is a lot for materializing sub cubes.

Stephan Eggermont 2008-12-09 23:12:00

Thanks for the answers.

Right now we are using SQL Server.
Main problems are cost (strong machines, SAN storage not mentioning licensing) and query performance (that's actually another cost issue).

As I see it, here are the main anchor in the architecture:
1. Scalability = Shared nothing (forget SQL Server, Oracle and the other oldies)
2. Column-Oriented DB (such as HBase, Sybase IQ, Vertica) as queries returns low number of results over low number of columns (with huge number or rows).
3. Hosting: I'd rather use hosted cloud than my own hardware for the ease of scaling.

I'd be happy to get your idea about the above.

few more informative details: Users - not too many, few concurrents. Performance - 80% of the reports should return in less then a second. Others (especially those with the big dims) shouldn't be running over 10 seconds.

Tnx, Haggai.

btw - Custom reporting is not an issue. That's what we do.

2008-12-10 22:13:22

I haven't used it personally, but Vertica on EC2 has been highly recommended to me by an organization with nearly identical needs.

Jason Watkins 2008-12-20 01:14:32

A column oriented DB (MonetDB or Vertica) could be interesting.

Stephan Eggermont 2009-01-07 21:43:04

An alternative for a low number of users would be a (beowulf) cluster. 20K buys you 50 nettops with 500G each. That's about 3KW peak power. Or 4 months of cloud storage.

Stephan Eggermont 2008-12-11 13:41:20

NXC, are you sure about those 600 billion rows per day? Even if one row would be just one byte, that's 600 GB of data daily. Assuming a more reasonable 100 bytes per row, we're talking about 60 TB of data per day, 1.8 PB per month. I really doubt anybody is pumping that much data through Oracle.

Other Sources seem to confirm that Oracle becomes quite difficult to handle when the data volume reaches 2-digit TB figures.

2008-12-12 13:42:38

That's what I was told by someone close to the source but it could possibly have lost something in translation - I suppose could be 600 million rows/day or 600GB/day, which is much more plausible. They do something funky with transportable tablespaces to shovel data around the various systems.

ConcernedOfTunbridgeWells 2008-12-17 23:42:30

Bear in mind that this outfit has a BI team with 800 people working in it for just the fixed line division, and another one that's not much smaller on the other side of town that does the mobile division.

ConcernedOfTunbridgeWells 2008-12-17 23:45:43

I'm not sure that large head-counts at national telcos are indicative of large amounts of work happening!

Will Dean 2008-12-19 23:35:09

No, but they are indicative of large budgets.

ConcernedOfTunbridgeWells 2008-12-24 12:17:10

+1 A:

My company has a solution that takes a radically different approach than traditional warehousing and map/reduce. We turn the model upside down by starting with the Answer Set required and then work backwards to figure out the best way to load and store the data to maximize throughput and query speed.

We call it the ProcessOnce Toolkit. We take massive datasets and put them through a two stage process, first is to normalize and compress the data into what we call our Staging File. At that point you have a queryable flat table with all of the dimensions (columns) that you would ever want to analyze.

The second stage is where the real magic happens, we use proprietary, sophisticated mathematical algorithms to pre-compute ALL of the combinations of Metrics and Dimensions (including distinct counts) as we build the file we call the Answer Set. This super-compressed Answer Set file is ODBC compatible, so you can point any front end tool you want at it. Since all the heavy lifting is now complete, queries run at "the speed of thought" because all they are really doing is looking up answers, not scanning huge tables.

With this methodology and toolkit, we can take multibillion row tables and completely process them (from raw data to pre-computed Answer Set) in minutes-to-small hours all running on commodity Linux servers.

In another post, someone asked how many users you have. With our Toolkit, it doesn't matter how many because you can think of the Answer Set file mentioned above like a PDF of the data. Just like a PDF, the file can be copied to any location or served across the Internet. Our client-side ODBC driver is the equivalent of the Adobe Acrobat Reader. So you can have as many users as you want.

You mentioned that you have hierarchies in your data. We fully support hierarchies and have a feature we call KeyBack, which lets you drill down to the detailed record level behind any query. Essentially, the Answer Set talks back to the Staging File to get the detailed records only when needed.

For your application, our Toolkit would handle daily processing in approx 2 hours, running on a $3000 linux server with a Toolkit license cost of $5000/month. You can serve the data to users on any old web server you want. The license covers unlimited data volume under that 4-processor server configuration and an unlimited number of users accessing the data.

The company is called BigEye Analytics, we've developed and have been using this Toolkit internally for many years servicing our consulting clients in the database marketing arena. In addition, a limited number of large telecom companies have been using the Toolkit for network analysis. We have a placeholder up at www.bigeyeanalytics.com as we get ready to launch the Toolkit commercially in the next month.

You can catch me on Skype if you'd like more details, davegoulden1.

2008-12-19 23:22:51

+1 A:

Read the site of Monash: http://www.dbms2.com/ He writes about big databases.

Maybe you can use Oracle Exadata (http://www.oracle.com/solutions/business_intelligence/exadata.html and http://kevinclosson.wordpress.com/exadata-posts/) or maybe you can use Hadoop. Hadoop is free.

tuinstoel 2008-12-20 00:17:04

+4 A:

I have had great success with vertica. I am currently loading anywhere between 200 million to 1 billion rows in a day - averaging about 9 billons row a month - though I have gone as high as 17 billion in a month. I have close to 21 dimensions and the queries run blazingly fast. We moved on from the older system when we simply didn't have the windows of time to do the dataload.

we did a very exhaustive trial and study of different solutions - and practically looked at everything on the market. While both Teradata and Netezza would have suited us, they were simply too expensive for us. Vertica beat them both on the price/performance ratio. It is by the way a columnar database.

We have about 80 users now - and it is expected to grow to about 900 by the end of next year when we start rolling out completely.

We are extensively using ASP.NET/dundas/reporting services for reports. It also plays nice with third party reporting solutions - though we haven't tried it.

By the way what are you going to use for dataload ? We are using informatica and have been very pleased with it. SSIS drove us up the wall.

no_one 2008-12-20 00:50:31

ansaurus

tags:

views:

answers:

20 Billion Rows/Month - Hbase / Hive / Greenplum / What ?

related questions