views:

494

answers:

14

I'm embarking on a project which will likely cross several million rows in the short future, so I am researching the database I use as that is sure to prove an issue. From what I have read, SQL in all its incarnations has issues once you get to the 2,000,000 rows issue for a table. Is there a good database recommended for these big size projects?

It is a website I am talking about, and archiving old entries is not ideal, though can be done if it proves to be an issue that I can't overcome.

Thanks.

+3  A: 

I've had tables in MS SQL Server with a fair bit more than 2 million rows without trouble. Of course, it depends on how you're using that data.

Just don't try using MySQL for something like this. At least from my experience, it just doesn't allow enough tweaking to provide high enough performance. I've run into a few cases with large amounts of data in (almost) identically set up tables. MySQL5 performed like 30 times slower than SQL Server on the same hardware. Extreme example maybe, but still.

I have too little experience with PostgreSQL or Oracle to judge, so I will just stick with not recommending MySQL. Or Access ;)

Thorarin
Hm, while I do not know your experiences, a blanket statement "it just doesn't allow enough tweaking to provide high enough performance" I find hard to believe. Consider that for example Flickr, Facebook and Wikipedia run on MySQL; they probably have tables with more than 2 mio rows.
sleske
To say Facebook uses MySQL is an oversimplification as well. They use a caching system because MySQL itself isn't fast enough. They've made modifications to the source code to improve their replication between different servers and the cache, etc. http://blog.facebook.com/blog.php?post=7899307130
Thorarin
+3  A: 

First off, a million records is not exactly a lot when databases are concerned. Any database worth it's salt should be able to handle that just fine.

Create proper indexes on your tables and almost any database will be able to handle those numbers of records. I've seen MySQL databases with millions of rows that worked just fine, and MySQL is not a heavyweight in database land.

MS SQL server PostgreSQL, DB2, Progress OpenEdge, almost anything will do if you create proper indexes. Things like MS Access (and possibly sqlite) may fall apart when you put a lot of data in them.

Gerco Dries
+2  A: 

One of the tables in my current project has 13 million rows in it. MS SQL Server handles it just fine. Really, 2 million rows is nothing.

But, seriously, if you want a high-end database, look to Oracle, Teradata, and DB2.

BoltBait
A: 

For most apps MS SQL will work fine. MySQL will work for smaller apps, but to answer your question if you are truly concerned about DB performance I would go with Oracle if you can afford it but if you are like the most of us who can't use an $80,000 database I would suggest MS SQL It works well. By the sounds of what you are doing (website) I would use MS SQL and utilize caching. Using the database correctly tends to be more important than using the correct database.

RHicke
Oracle pricing starts at well under $80,000. Standard Edition One is $6,000 per processor.
Gary
+3  A: 

No database that would call themselves an SQL database if they had issues with 2 million records. You can get in trouble with some databases with 2 billion records though.

I've had mysql databases with well over 150 million records without trouble. You need to figure out what features you need from a database before you're deciding, not ponder over a few million rows - which is not much at all.

leeeroy
+1  A: 

2,000,000 rows is really not much at all. I've seen plenty of tables with > 50 million rows with acceptable performance, in MS SQL.

IMHO you're still pretty far away from being a 'big database'

+3  A: 

We run lots of databases with row counts in the hundreds of millions in MSSQL (2000, 2005, 2008). Your row count isn't where your problem will arise, it's in the characteristics of access to the data. Depending on how it looks, you may need to scale across separate hardware, and that is where the differences between database servers will really show up (that and price...)

Mark
A: 

Try looking at other large organization to see what they're using. MS's proof of concept for very large databases is TerraServer, which is running a database that is several terabytes in size.

Any database will have problems with a small dataset if you are doing table scans, cartesian products, expensive calculations for each row, etc.

To really stress a relational db with a table of 2 million rows, you'd have to be doing cross tabs while doing a large number of inserts and updates and then you'd want to switch to an olap datastore.

Do you have anything else to describe the expected workload? Is this mostly readonly, read write, etc?

MatthewMartin
While interesting, terraserver is a little outdated :) Microsoft has a whitepaper that compares SQL Server to Oracle 11g. Not objective of course, but they mention some interesting facts, like an example of a company hosting 100TB of data on SQL Server. http://www.microsoft.com/sqlserver/2008/en/us/sql2008-oracle11g.aspx
Thorarin
+2  A: 

Microsoft SQL Server, MySQL, Oracle, and DB2 can all handle millions and millions of rows without a problem.

The problem will be finding a DBA who knows how to design and manage it properly so you get the performance characteristics you're looking for.

richardtallent
+1  A: 

As others have said, any decent DB can handle that sort of load. I've used MS SQL Server and PostgreSQL for databases of that size before, both work great. I'd recommend PostgreSQL because it's free and open. I've never done a performance comparison, but it seems to be very capable. I'd avoid DB2 or Oracle because they're very hard to use (unless you want to pay for a full time DBA, in which case such a person might be able to squeeze better performance out of those than any other solution, especially with Oracle).

rmeador
+1  A: 

I concur with richardtallent. The big name database systems have all provided us with good tools for large databases. (2 million rows is nothing, although you can see performance issues with lousy indexes or poor choices in the select statements, especially if you are joining across multiple tables of similar size.). It comes down to pros and cons with costs, usability, cost of support, etc.

I can speak most to Oracle and SQL Server. Oracle is pretty pricey, and it takes a pricey, dedicated DBA to really use it right. It isn't known for usability, but a DBA or programmer comfortable with it can work just fine in it. It also has great flexibility and some believe it is more powerful than the others. (I don't know if that's true or not, but I know it certainly provides lots of different ways you can tweak it for efficiency, etc.)

SQL Server can certainly handle large datasets just fine. It has a "prettier" face and tends to be considered more usable, but usability in the end is a matter of opinion. It does have a cheaper price tag, but you might have just a bit less flexibility than Oracle. You can get a "cheap" SQL Server dba, because its user-friendly interface makes it easy for people to do some of the basic DBA tasks without being experts. But you get what you pay for (usually) and if you really want efficiency and security, you pay for an expert anyway.

Those are just a few of the things to consider when looking at DBs. I'm sure MySQL and DB2 have their own pros and cons to be weighed.

But none of them have a problem with a measly 2 million rows. (I regularly work in a database with hundreds of tables, some of which have over 50 million rows, and I see little performance hit b/c the DBAs know what they are doing.)

FOLLOW UP EDIT: Since this is for a website, maybe your biggest consideration should be integration of front/back. For example, if you are using ASP for the web, SQL Server is a natural choice.

sql_mommy
Good point about the integration bit. Of course, you can use just about any combination of front-end and database, but the road will likely be bumpier :)
Thorarin
A: 

Properly configured, 2MM rows is not a big deal for most of the commercial DBs and may not be for the Open Source DBs - I don't know enough about MySQL et al to have an opinion.

By SQL I assume the original poster means MS SQL Server. While there were some scaling issues in the 2000 release, they seem to have been mostly addressed in 2005 and 2008. I have one testdb that has significantly more than 2 MM rows running now and running quite well.

Respectfully I think the question is badly stated - you need to describe much more information to get a useful answer. Size of the database, number of tables, number of common joins, will it be optimized for read, write or both, number of concurrent users that will be supported, replication, geographic location of end users vs database server, hardware configuration.

In general I have found SQL Server post 2005 works in a lot of cases very well. If you need the ability to tune everything at the lowest level both Oracle and DB2 give you better access and documentation to do that.

If your need is primarily a data warehouse and you have the cash then I would look at Neteeza or Teradata. I am fan of NZ but we are partners so I am biased.

Hope that helps,

Terence

Terence
A: 

Keep in mind that if you have a large amount of data:

  • indexing columns you join tables on is ESPECIALLY important
  • writing efficient queries can make a huge difference
  • if you query data all the time and rarely write new rows, you can create clustered indexes and materialized views to retrieve data much more efficiently, based on what queries you use most often
Jon
A: 

As a lot of people have already said, that amount of records is not a problem if your database design is properly done.

But there may be another aspect worth considering. How many users, namely how many simultaneous users, do you expect you application to have? If you expect to get a lot of users, you also need to consider the scalability of the database engine, or database design.

MSSql service may not be expensive for a single server setup, but if you need to scale up, e.g. run on 4 CPUs, the licensing becomes very expensive. And when you have pushed the limit of a single server, and you need to scale out to multiple servers, what do you do? I don't have the answer to that, except that as far as I know, MS SQL Server does not directly support load balancing.

Just a thought

Pete