Database solution for 200million writes/day, monthly summarization queries | ansaurus

tags:

views:

126

answers:

3

+3 Q:

Database solution for 200million writes/day, monthly summarization queries

Hello.

I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.)

I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queries on that data. The summary queries would be for collecting data for things like billing statements, eg. "How many transactions of type A did each user run this month?" (could be more complex, but that's the general idea).

I can spread the database amongst several machines, as necessary, but I don't think I can take old data offline. I'll definitely need to be able to query a month's worth of data, maybe a year. These queries would be for my own use, and wouldn't need to be generated in real-time for an end-user (they could run overnight, if needed).

Does anyone have any suggestions as to which databases would be a good fit?

P.S. Cassandra looks like it would have no problem handling the writes, but what about the huge monthly table scans? Is anyone familiar with Cassandra/Hadoop MapReduce performance?

A:

Cassandra + Hadoop does sound like a good fit for you. 200M/8h is 7000/s, which a single Cassandra node could handle easily, and it sounds like your aggregation stuff would be simple to do with map/reduce (or higher-level Pig).

jbellis 2010-05-01 15:32:38

A:

Greenplum or Teradata will be a good option. These databases are MPP and can handle peta-scale data. Greenplum is a distributed PostgreSQL db and also has it own mapreduce. While Hadoop may solve your storage problem but it wouldn't be helpful for performing summary queries on your data.

Harsha Hulageri 2010-07-11 03:49:52

related questions

What language do you use for Postgresql triggers and stored procedures?

Are Multiple DataContext classes ever appropriate?

Which tools do people use to create Data Dictionaries?

Any experiences with Protocol Buffers?

Mechanisms for tracking DB schema changes

How big can a MySQL database get before performance starts to degrade.

How do I index a database field

How does database indexing work?

How do I connect to a database and loop over a recordset in C#?

Editing database records by multiple users

Object Oriented vs Relational Databases

VFP .NET OLEdb provider does not work in Win 64-Bits. Help

Embedded Database for .net that can run off a network

Connect PHP to an AS/400

Swap unique indexed column values in database.

cx_Oracle - what is the best way to iterate over a result set?

cx_Oracle - How do I access Oracle from Python?

.NET Migrations Engine

Is there a version control system for database structure changes?

SQLite and XSD

How do I version my MS SQL database in SVN?

XSD DataSets and ignoring foreign keys

Flat File Databases in PHP

Throw Error In MySQL Trigger

Binary Data in MySQL