views:

291

answers:

5

We have a huge data of about 300 million records, which will get updated every 3-6 months.We need to query this data(continously, real time) to get some information.What are the options - a RDBMS(mysql) , or some other option like Hadoop.Which will be better?

+1  A: 

Well, I have a few PostgreSQL databases with some tables with more than 700M records and they are updated all the time.

A query in those tables works very fast (a few milliseconds) and without any problems. Now, my data is pretty simple, and I have indexes on the fields I query.

So, I'd say, it will all depends on what kind of queries you'll be making, and if you have enough money to spend on fast disks.

mat
Those are some incredible numbers. I've never experienced anything like that with oracle or MSSQL. Do you think it's really the machines and disks that do the bulk of the work?
Kieveli
Well, that particular table has 782M records, it takes 62GB on the disk, and the main index takes 17GB. On that box, there is 3.5GB of memory. So, each time I need an element, it will have to read the index, which won't be in memory, so, yes, fast disk matters.
mat
A: 

300 million records should pose no problems to a top-end RDBMS like Oracle, SQL Server, DB2. I'm not sure about mySQL, but I'm pretty sure it gets used for some pretty big databases these days.

Tony Andrews
+1  A: 

As others said, modern RDBMS can handle such tables, depending on the queries and schema (some optimizations would have to be made). If you have a good key to split the rows by (such as a date column), then partioniong/sharding techniques will help you split the table into several small ones.

You can read more on those and other scaling techniques in a question I asked sometime ago here - http://stackoverflow.com/questions/189903/scaling-solutions-for-mysql-replication-clustering

Eran Galperin
A: 

300 Million does not really count as huge these days :-).

If you are mostly querying, and, you know more or less what form the queries will take then MySQL tables with the appropriate indexes will work just fine.

If you are constantly appying updates at the same time as you are running queries then choose PostgreSQL as it has better concurrency handling.

MS SQLServer, Sybase, Oracle and DB2 will all handle these volumes with ease if your company prefers to spend money.

If on the other hand you intend to do truly free format queries on unstructured data then Hadoop or similar would be a better bet.

James Anderson
+2  A: 

300M records is well within the bounds of regular relational databases and live querying should be no problem if you use indexes properly.

Hadoop sounds like overkill unless you really need highly distributed and redundant data, and it will also make it harder to find support if you run into trouble or for optimizations.

Tiberiu Ana