views:

211

answers:

4

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:

  1. Lots of writes and Lots of reads on same tables, very realtime
  2. Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
  3. Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
  4. Each row of data may contains lots of attributes to deal with

I am suggesting/having following as a solution:

  1. Use distributed hash table sort of persistence (not S3 but an inhouse one)
  2. Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
  3. Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)

What do you guys think? Am i making any sense here?

+2  A: 

Are your goals performance, maintainability, improving the odds of success, being cutting edge?

Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.

Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.

MatthewMartin
My intention is to see how good it is to use non-relational database and distributed processing of queries. I am not sure how relational database or like that architecture could work in this scenario
Vadi
Then I recommend putting together a test harness with realistic data. Anything else is just wild speculation and pre-mature optimization. I also recommend tracking down big companies that have similar work loads and seeing if there is any public info on their archtitecture, e.g. Digg.com runs on mysql, google runs on a distributed DB like you mention. But google is probably processing TBs of data per second. We can only hope that you're project will be that successful.
MatthewMartin
I am with you ...
Vadi
+2  A: 

"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"

This is the hallmark of a data warehouse.

Here's the trick with DW processing.

  1. Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.

  2. To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.

  3. Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.

Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.

Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.

You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.

Get Ralph Kimball's Data Warehouse Toolkit books.

S.Lott
A: 

"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.

Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.

Chochos
+1  A: 

Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.

SquareCog