tags:

views:

1102

answers:

14

Here, my boss, is asking me and the people at the office which Database engine could be the best for a very very huge amount of data. We have been discussing a while in the office and no conclusion came... But as I think that this is a pretty interesting question for discussing

+2  A: 

Hadoop

Javier
Hadoop isn't a database
skaffman
+2  A: 

Properly configured Oracle or Microsoft's SQL Server should be able to handle anything you throw at it.

AR
A: 

See also http://stackoverflow.com/questions/205324/which-db-should-i-select-if-performance-of-postgres-is-low

Yahoo uses PostgeSQL - http://postgresqldbnews.blogspot.com/2008/05/world-largest-database-runs-on-postgres.html

Google uses MySQL - http://www.mysql.com/customers/customer.php?id=75

I've seen Oracle databases in the hundreds of gigs to terabyte range

Likewise with DB2

I haven't personally seen anything on SQL Server into the terabyte range, but it may very well be capable of it

warren
Google does not use MySQL like it is, there are a few blogs that tells exactly what Google did. They picked up MySQL and programmed on top of it all new features that you do not find in a download version of MySQL. This is why MySQL is working together with Google in order to give new versions.
balexandre
Lots of terrabyte SQl Server databases out there.
HLGEM
Similar to what balexandre said, that article warren linked to points out that Yahoo heavily customized the backend to optimize it for what they use it for. These additions are not in the normal PostgreSQL.
R. Bemrose
Adwords runs on mysql, but the google web index runs on BigTable/GFS.
Joeri Sebrechts
the additions done to both MySQL and PostgreSQL have been being pushed back down into the codebases, either directly from Google/Yahoo or at least in terms of the general concepts (if not specific code)
warren
A: 

Depends on what you mean by huge. Indexs are going to be in order, and queries made will need to be optimized as best as possible. I've worked with what I consider huge data bases in both Microsoft SQL Server (11 Million records daily logging events at a large global automotive company), and in PostGRES (Several million records as well for GIS data).

stephenbayer
+6  A: 

Some of the largest databases in the world are run on Microsoft's SQL server: http://www.microsoft.com/sql/prodinfo/compare/wintercorp.mspx

And here's a listing of the 10 largest databases in the world (although I have no idea how accurate it is):

http://www.businessintelligencelowdown.com/2007/02/top_10_largest_.html

The top is the World Data Centre for climate with 220 terabytes of web data and 6 petabytes of additional data

amdfan
I wish! But no, I don't.
amdfan
+7  A: 

To get a meaningful answer it might help to know what you consider a very very huge amount of data. It would also help to know what it would be used for and how it would be organised.

If it's gigabytes of data most databases would be fine.

If it's terabytes of data i'm pretty sure Oracle or SQL Server won't have a problem with it. If they're configured and designed correctly.

If it's bigger... something like bigtable.

Bravax
It is also worth noting that MySQL is capable of handing data sizes in a similar range to Oracle or SQL Server. More info can be found at wikipedia: http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
cdeszaq
+3  A: 

Probably more important than knowing how big is huge is to know what do you want to do with the database. If it is an OLTP then you will be fine with any of the main players as answered before but if it is for a reporting tool then you may want to investigate Alterian and Sybase IQ instead.

CodeForNothing
+3  A: 

Really depends on what your idea of huge is, and what you want to do with it. For SQL-like access:

  • Gigabytes of data can easily be handled by any FOSS or commercial product.
  • Hundreds of gigabytes+ usually means something like Teradata

For more specialized processing, Hadoop and HBase are appropriate. (Several similar products exist including Sector/Sphere and GridGain to name a couple.) Hadoop is a cloud-computing architecture modeled on Google's filesystem, and can hold hundreds of petabytes. HBase is a "database" which runs on Hadoop, with similar capabilities. I say "database" because it is column-oriented, a very different model from row-oriented databases like MySQL, PostreSQL, Oracle, etc.

Hadoop/HBase are more suited either to data warehousing, or situations where you can precompute the queries you'll need to run, and have them executed out-of-band via MapReduce.

jonathan-stafford
A: 

Remember a large database is much harder than just choosing the database. All queries will need to be optimized, indexing is critical and you must have the hardware capable of handing the data. You may need to design a partitioning scheme to properly separate the data to ensure better access. Poor techniques and designs that you can get away with in smaller databases cause huge problems in large databases. Do not do this without investing in a good performance tuning book! Understand performance tuning before you design the database.

Whatever database you choose, make sure you get the full featured Enterprise version of it. You will probably need features that these versions have that the less featured and cheaper versions do not.

Oracle, SQL Server, db2, etc will all handle large enterpise databases. Whether they handle them well will depend on your hardware, your design and your queries. This is not a design task for the inexperienced.

HLGEM
A: 

This posting I made a few weeks ago discusses the relative merits of Oracle and SQL Server for VLDB applications with a segue into shared nothing systems (of which the best known example is Teradata) that give you the best degree of scalability through having no central I/O bottleneck.

It might be quite a good start.

ConcernedOfTunbridgeWells
A: 

In addition to the question of size asked others above, the structure and access patterns of your application influence the answer. If your data is updated infrequently and you can manage the relationships without a RDBMS then Berkeley DB may be a good choice. Even if these criteria don’t apply BDB may be an option for a read-only cache.

Pat Leahy
A: 

We looked at Sybase IQ (http://en.wikipedia.org/wiki/Sybase_IQ), but we decided our data warehouse isn't large enough to warrant it.

AJ
+12  A: 

As a DBA, here's the real answer.

The best database platform in the world can be unable to handle the tiniest of loads if you don't have staff that knows how to use it.

A fairly low-end database platform can scale up to huge loads if you have staff that knows it inside and out and works around its weaknesses.

This is why some people say Oracle can't scale, and why other people say MySQL can rule the world. A good musician can still make some beautiful sounds with a crappy instrument, and crappy musicians can't make the best instruments sound good.

When you're deciding what platform to use for your next project, don't go throwing dice and using a solution you have no experience with. If you really need to scale up, either use the solution you know inside and out, or hire somebody who knows the new solution inside and out.

Brent Ozar
+1 good words, so true.
JavaRocky
A: 

Sybase IQ is an excellent choice as long as you are doing more reading than writing. IQ isn't particularly quick for writes but is blisteringly quick for reads. IQ houses the world's biggest data warehouse (1 petabyte)

In comparison to Oracle, DB2 and SQL Server the licence fees are also fairly low and the hardware requirements are also lower. The downside is that support, documentation and tools are a bit thin on the ground (as are experienced IQ developers and DBA's)

Steve