Dealing with Gigabytes of Data

views:

211

answers:

+2 Q:

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:

Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with

I am suggesting/having following as a solution:

Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)

What do you guys think? Am i making any sense here?

+2 A:

Are your goals performance, maintainability, improving the odds of success, being cutting edge?

Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.

Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.

MatthewMartin 2009-07-31 11:33:25

My intention is to see how good it is to use non-relational database and distributed processing of queries. I am not sure how relational database or like that architecture could work in this scenario

Vadi 2009-07-31 11:38:47

Then I recommend putting together a test harness with realistic data. Anything else is just wild speculation and pre-mature optimization. I also recommend tracking down big companies that have similar work loads and seeing if there is any public info on their archtitecture, e.g. Digg.com runs on mysql, google runs on a distributed DB like you mention. But google is probably processing TBs of data per second. We can only hope that you're project will be that successful.

MatthewMartin 2009-07-31 12:15:02

I am with you ...

Vadi 2009-08-02 13:48:23

+2 A:

"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"

This is the hallmark of a data warehouse.

Here's the trick with DW processing.

Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.

Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.

Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.

You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.

Get Ralph Kimball's Data Warehouse Toolkit books.

S.Lott 2009-07-31 11:41:58

"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.

Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.

Chochos 2009-07-31 18:55:50

+1 A:

Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.

SquareCog 2009-08-02 20:25:59

ansaurus

tags:

views:

answers:

Dealing with Gigabytes of Data

related questions