views:

790

answers:

7

I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).

The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net.

I'm a SQL server guy and I think SQL Server can do this, but with all the talk about BigTable, CouchDB, and other nosql solutions, it's sounding more and more like an alternative to a traditional RDBS may be best due to optimizations for distributed queries and scaling. I tried cassandra and the .net libraries don't currently compile or are all subject to change (along with cassandra itself).

I've looked into many nosql data stores available, but can't find one that meets my needs as a robust production-ready platform.

If you had to store 36 billion small, flat records so that they're accessible from .net, what would choose and why?

A: 

Store records in plain binary files, one file per GUID, wouldn't get any faster than that.

thomask
Do you really expect this to perform well?
ChaosPandion
Yea, creating billions of files on file system can be devastating to some file systems. I made the mistake of doing something like this, but with only 1 million and I pretty much took the system down trying to open a shell to one of those folders. Also, unless you are looking up based on a guid, how is the query mechanism supposed to work?
Rob Goodwin
It is hard to guess how this would perform without knowing how many unique GUID's are expected :)But is doesn't get any simpler than just writing to plain files. And fast inserts along with lookup by GUID was the only requirement.
thomask
It can work but you have to limit the number of files per folder. You have to generate a new folder per n files. You could use a substring of the guid as folder name.
TTT
+1. Because I think it can work.
tuinstoel
A: 

If you had to store 36 billion small, flat records so that they're accessible from .net, what would choose and why?

I would check Berkly DB according to the DataSheet

Why: It Scales to 256 terabytes of data in a single table and provides fast indexed and sequential access. It's free and mature,

stacker
+1  A: 

Very few people work at the multi-billion row set size, and most times that I see a request like this on stack overflow, the data is no where near the size it is being reported as.

36 billion, 3 billion per month, thats roughly 100 million per day, 4.16 million an hour, ~70k rows per minute, 1.1k rows a second coming into the system, in a sustained manner for 12 months, assuming no down time.

Those figures are not impossible by a long margin, i've done larger systems, but you want to double check that is really the quantities you mean - very few apps really have this quantity.

In terms of storing / retrieving and quite a critical aspect you have not mentioned is aging the older data - deletion is not free.

The normal technology is look at is partitioning, however, the lookup / retrieval being GUID based would result in a poor performance, assuming you have to get every matching value across the whole 12 month period. You could place a clustered indexes on the GUID column will get your associated data clusterd for read / write, but at those quantities and insertion speed, the fragmentation will be far too high to support, and it will fall on the floor.

I would also suggest that you are going to need a very decent hardware budget if this is a serious application with OLTP type response speeds, that is by some approximate guesses, assuming very few overheads indexing wise, about 2.7TB of data.

In the SQL Server camp, the only thing that you might want to look at is the new parrallel data warehouse edition (madison) which is designed more for sharding out data and running parallel queries against it to provide high speed against large datamarts.

Andrew
In bioinformatics billion-row datasets are not uncommon. But they are frequently dealt with in a purely streaming fashion from flat files.
Erik Garrison
@Erik: for stream processing (ie. just need to detect certain conditions, but there is no need to store the data for later querying) something like StreamInsight is better than any database http://www.microsoft.com/sqlserver/2008/en/us/r2-complex-event.aspx
Remus Rusanu
+2  A: 

Contrary to popular belief, NoSQL is not about performance, or even scalability. It's mainly about minimizing the so-called Object-Relational impedance mismatch, but is also about horizontal scalability vs. the more typical vertical scalability of an RDBMS.

For the simple requirement of fasts inserts and fast lookups, almost any database product will do. If you want to add relational data, or joins, or have any complex transactional logic or constraints you need to enforce, then you want a relational database. No NoSQL product can compare.

If you need schemaless data, you'd want to go with a document-oriented database such as MongoDB or CouchDB. The loose schema is the main draw of these; I personally like MongoDB and use it in a few custom reporting systems. I find it very useful when the data requirements are constantly changing.

The other main NoSQL option is distributed Key-Value Stores such as BigTable or Cassandra. These are especially useful if you want to scale your database across many machines running commodity hardware. They work fine on servers too, obviously, but don't take advantage of high-end hardware as well as SQL Server or Oracle or other database designed for vertical scaling, and obviously, they aren't relational and are no good for enforcing normalization or constraints. Also, as you've noticed, .NET support tends to be spotty at best.

All relational database products support partitioning of a limited sort. They are not as flexible as BigTable or other DKVS systems, they don't partition easily across hundreds of servers, but it really doesn't sound like that's what you're looking for. They are quite good at handling record counts in the billions, as long as you index and normalize the data properly, run the database on powerful hardware (especially SSDs if you can afford them), and partition across 2 or 3 or 5 physical disks if necessary.

If you meet the above criteria, if you're working in a corporate environment and have money to spend on decent hardware and database optimization, I'd stick with SQL Server for now. If you're pinching pennies and need to run this on low-end Amazon EC2 cloud computing hardware, you'd probably want to opt for Cassandra or Voldemort instead (assuming you can get either to work with .NET).

Aaronaught
+13  A: 

Storing ~3.5TB of data and inserting about 1K/sec 24x7, and also querying at a rate not specified, it is possible with SQL Server, but there are more questions:

  • what availability requirement you have for this? 99.999% uptime, or is 95% enough?
  • what reliability requirement you have? Does missing an insert cost you $1M?
  • what recoverability requirement you have? If you loose one day of data, does it matter?
  • what consistency requirement you have? Does a write need to be guaranteed to be visible on the next read?

If you need all these requirements I highlighted, the load you propose is going to cost millions in hardware and licensing on an relational system, any system, no matter what gimmicks you try (sharding, partitioning etc). A nosql system would, by their very definition, not meet all these requirements.

So obviously you have already relaxed some of these requirements. There is a nice visual guide comparing the nosql offerings based on the 'pick 2 out of 3' paradigm at Visual Guide to NoSQL Systems:

nosql comparisson

After OP comment update

With SQL Server this would e straight forward implementation:

  • one single table clustered (GUID, time) key. Yes, is going to get fragmented, but is fragmentation affect read-aheads and read-aheads are needed only for significant range scans. Since you only query for specific GUID and date range, fragmentation won't matter much. Yes, is a wide key, so non-leaf pages will have poor key density. Yes, it will lead to poor fill factor. And yes, page splits may occur. Despite these problems, given the requirements, is still the best clustered key choice.
  • partition the table by time so you can implement efficient deletion of the expired records, via an automatic sliding window. Augment this with an online index partition rebuild of the last month to eliminate the poor fill factor and fragmentation introduced by the GUID clustering.
  • enable page compression. Since the clustered key groups by GUID first, all records of a GUID will be next to each other, giving page compression a good chance to deploy dictionary compression.
  • you'll need a fast IO path for log file. You're interested in high throughput, not on low latency for a log to keep up with 1K inserts/sec, so stripping is a must.

Partitioning and page compression each require an Enterprise Edition SQL Server, they will not work on Standard Edition and both are quite important to meet the requirements.

As a side note, if the records come from a front-end Web servers farm, I would put Express on each web server and instead of INSERT on the back end, I would SEND the info to the back end, using a local connection/transaction on the Express co-located with the web server. This gives a much much better availability story to the solution.

So this is how I would do it in SQL Server. The good news is that the problems you'll face are well understood and solutions are known. that doesn't necessarily mean this is a better than what you could achieve with Cassandra, BigTable or Dynamo. I'll let someone more knowleageable in things no-sql-ish to argument their case.

Note that I never mentioned the programming model, .Net support and such. I honestly think they're irrelevant in large deployments. They make huge difference in the development process, but once deployed it doesn't matter how fast the development was, if the ORM overhead kills performance :)

Remus Rusanu
+1 - For having the cojones to throw in a nice big colorful chart. :)
ChaosPandion
I hot linked Nathan's site, but this is not slashdot front page ;)
Remus Rusanu
A: 

There is an unusual fact that seems to overlooked.

"Basically after inserting 30Mil rows in a day, I need to fetch all the rows with the same GUID (maybe 20 rows) and be reasonably sure I'd get them all back"

Needing only 20 columns, a non-clustered index on the GUID will work just fine. You could cluster on another column for data dispersion across partitions.

I have a question regarding the data insertion: How is it being inserted?

  • Is this a bulk insert on a certain schedule (per min, per hour, etc)?
  • What source is this data being pulled from (flat files, OLTP, etc)?

I think these need to be answered to help understand one side of the equation.

Josef Richberg
A: 

You can use MongoDB and use the guid as the sharding key, this means that you can distribute your data over multiple machines but the data you want to select is only on one machine because you select by the sharding key.

Sharding in MongoDb is not yet production ready.

Theo