Here, my boss, is asking me and the people at the office which Database engine could be the best for a very very huge amount of data. We have been discussing a while in the office and no conclusion came... But as I think that this is a pretty interesting question for discussing
Properly configured Oracle or Microsoft's SQL Server should be able to handle anything you throw at it.
See also http://stackoverflow.com/questions/205324/which-db-should-i-select-if-performance-of-postgres-is-low
Yahoo uses PostgeSQL - http://postgresqldbnews.blogspot.com/2008/05/world-largest-database-runs-on-postgres.html
Google uses MySQL - http://www.mysql.com/customers/customer.php?id=75
I've seen Oracle databases in the hundreds of gigs to terabyte range
Likewise with DB2
I haven't personally seen anything on SQL Server into the terabyte range, but it may very well be capable of it
Depends on what you mean by huge. Indexs are going to be in order, and queries made will need to be optimized as best as possible. I've worked with what I consider huge data bases in both Microsoft SQL Server (11 Million records daily logging events at a large global automotive company), and in PostGRES (Several million records as well for GIS data).
Some of the largest databases in the world are run on Microsoft's SQL server: http://www.microsoft.com/sql/prodinfo/compare/wintercorp.mspx
And here's a listing of the 10 largest databases in the world (although I have no idea how accurate it is):
http://www.businessintelligencelowdown.com/2007/02/top_10_largest_.html
The top is the World Data Centre for climate with 220 terabytes of web data and 6 petabytes of additional data
To get a meaningful answer it might help to know what you consider a very very huge amount of data. It would also help to know what it would be used for and how it would be organised.
If it's gigabytes of data most databases would be fine.
If it's terabytes of data i'm pretty sure Oracle or SQL Server won't have a problem with it. If they're configured and designed correctly.
If it's bigger... something like bigtable.
Probably more important than knowing how big is huge is to know what do you want to do with the database. If it is an OLTP then you will be fine with any of the main players as answered before but if it is for a reporting tool then you may want to investigate Alterian and Sybase IQ instead.
Really depends on what your idea of huge is, and what you want to do with it. For SQL-like access:
- Gigabytes of data can easily be handled by any FOSS or commercial product.
- Hundreds of gigabytes+ usually means something like Teradata
For more specialized processing, Hadoop and HBase are appropriate. (Several similar products exist including Sector/Sphere and GridGain to name a couple.) Hadoop is a cloud-computing architecture modeled on Google's filesystem, and can hold hundreds of petabytes. HBase is a "database" which runs on Hadoop, with similar capabilities. I say "database" because it is column-oriented, a very different model from row-oriented databases like MySQL, PostreSQL, Oracle, etc.
Hadoop/HBase are more suited either to data warehousing, or situations where you can precompute the queries you'll need to run, and have them executed out-of-band via MapReduce.
Remember a large database is much harder than just choosing the database. All queries will need to be optimized, indexing is critical and you must have the hardware capable of handing the data. You may need to design a partitioning scheme to properly separate the data to ensure better access. Poor techniques and designs that you can get away with in smaller databases cause huge problems in large databases. Do not do this without investing in a good performance tuning book! Understand performance tuning before you design the database.
Whatever database you choose, make sure you get the full featured Enterprise version of it. You will probably need features that these versions have that the less featured and cheaper versions do not.
Oracle, SQL Server, db2, etc will all handle large enterpise databases. Whether they handle them well will depend on your hardware, your design and your queries. This is not a design task for the inexperienced.
This posting I made a few weeks ago discusses the relative merits of Oracle and SQL Server for VLDB applications with a segue into shared nothing systems (of which the best known example is Teradata) that give you the best degree of scalability through having no central I/O bottleneck.
It might be quite a good start.
In addition to the question of size asked others above, the structure and access patterns of your application influence the answer. If your data is updated infrequently and you can manage the relationships without a RDBMS then Berkeley DB may be a good choice. Even if these criteria don’t apply BDB may be an option for a read-only cache.
We looked at Sybase IQ (http://en.wikipedia.org/wiki/Sybase_IQ), but we decided our data warehouse isn't large enough to warrant it.
As a DBA, here's the real answer.
The best database platform in the world can be unable to handle the tiniest of loads if you don't have staff that knows how to use it.
A fairly low-end database platform can scale up to huge loads if you have staff that knows it inside and out and works around its weaknesses.
This is why some people say Oracle can't scale, and why other people say MySQL can rule the world. A good musician can still make some beautiful sounds with a crappy instrument, and crappy musicians can't make the best instruments sound good.
When you're deciding what platform to use for your next project, don't go throwing dice and using a solution you have no experience with. If you really need to scale up, either use the solution you know inside and out, or hire somebody who knows the new solution inside and out.
Sybase IQ is an excellent choice as long as you are doing more reading than writing. IQ isn't particularly quick for writes but is blisteringly quick for reads. IQ houses the world's biggest data warehouse (1 petabyte)
In comparison to Oracle, DB2 and SQL Server the licence fees are also fairly low and the hardware requirements are also lower. The downside is that support, documentation and tools are a bit thin on the ground (as are experienced IQ developers and DBA's)