tags:

views:

968

answers:

8

Ok, dumb question I know but I see the nebulous comment 'a large database' as well as small and medium and I wonder just what that means. Can someone define what a small, medium and large database is for us SQL neophytes?

+13  A: 

One way to figure it is by observing your test queries.

A small database is one where indexes don't matter.

A medium database is one where queries take longer than one second if you don't have an appropriate index in place.

A big database is one where queries often take hours to optimize, using a combination of query design, index modification, and many test cycles.

le dorfier
+1. This is an interesting definition
Mitch Wheat
@le dorfier: BTW I believe you were right about the atomic update with max select (although I still wouldn't do it that way)
Mitch Wheat
+30  A: 

There isn't a threshold where a small database becomes medium or a medium database becomes large. Generally, when I hear these terms, I think of particular orders of magnitude in terms of total records being stored.

  • Small: 105 or fewer records.
  • Medium: 105 to 107 records.
  • Large: 107 to 109 records.
  • Very large: 109 or greater number of records.

As poster le dorfier suggested, you could also think about it in terms of the properties each kind of database has. Categorizing it this way, I'd say:

  • Small: Performance is not a concern. Your queries run fine without making any special optimizations. You see only a marginal performance difference when using front-line enhancements like indexes.

  • Medium: Your database probably has one or more staff that are assigned part-time to its maintenance and care. These people pay attention to the database's health; their primary administrative responsibility is to prevent unacceptable performance problems and minimize downtime.

  • Large: Probably has dedicated staff member(s) whose job is to work on the database and improve performance, as well as make sure that application changes don't cause schema breakage over the lifetime of the database. Metrics about the health and status of the database are monitored closely. Significant expertise is required to understand and perform optimizations.

  • Very large: The database stores vast amounts of information that must be readily accessible. Performance optimizations are absolutely required to wring every last ounce of speed out of each queries, and without it, the database would be much less usable or even impossible to use. The database may be using sophisticated or innovative replication or clustering techniques, pushing the boundaries of current technology.

Note that these are entirely subjective, and that someone may very well have a perfectly legitimate alternate definition of "large".

John Feminella
Superb answer, almost exactly what I would have said which is interesting considering the subjectivity and moving goalposts.
Peter Wone
Excellent answer John. Very concise. I tried to explain the same but went on a different and more complex route :S
vmarquez
I like the second part of the answer but the first part, relating size to number of records, is a bit misleading I think. You could have a really simple table with tons of records, or a small number of records but a very complicated organization of tables.
Outlaw Programmer
Actually, I'd say that either of your two examples could well qualify as large. Are you suggesting that an enormous property-key dictionary consisting of a single table with 50 million records is in actuality a "small database"?
John Feminella
I'd say it's legitimate to consider the converse as small too. Conversely, consider an enormously complex schema structure consisting of 10,000 tables, but which only contains 5 rows in total. Is this a "large database"?
John Feminella
I might consider the case of 1 table with 50m records a "large" DB, but you probably don't need a dedicated DBA to manage it. I felt that you're trying to couple total records with # of DBAs, which may not be the case with simple databases that have only a few tables but many records.
Outlaw Programmer
Sure, but that's why these are general categories, not hard and fast rules. As noted, this is an entirely subjective categorization.
John Feminella
Interesting analysis!
Kyle Kaitan
+1  A: 

According to wikipedia article on Very Large Database

A very large database, or VLDB, is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte or contains several billion rows, although naturally this definition changes over time.

karlcow
A: 

I think something like wikipedia, or the US census data is a 'big' database. My personal address lists or todos is a small database. A middle sized database is something in between.

You could try and define the sizes by how many servers you needed. A small database is a component of an application you run on your desktop, a mid-sized database would be a single mysql (whatever) server somewhere, and a large database is going to require multiple servers with some kind of replication/failover support.

Zoredache
+1  A: 

Best answer, hands-down: large database are ones that force you have to stop using relational databases.

In other words, a normalized, relational database where all the indexes in the world can't help you meet your response time requirements because of the massive JOINs.

If you've ever had to abandon relational databases for something else, you're either a poor database developer, have no expert DBA, or have a very large database.

Chris
+2  A: 

“Large Database” is indeed a nebulous concept. There are already very different answers and opinions posted in the answers to this question. Some approaches to define “small”, “medium” and “large” Databases may make more sense than others BUT THEN, at some point, I consider each definition is right, true and valid.

Some definitions make more sense than others because they focus on different aspects of importance for the design, programming, use, maintenance and administration of a Database and these different aspects are what really matter for a usable Database. It just happens that all these aspects are impacted by the nebulous concept of “Database size”.

So, Does this mean that it does not matter if you are able to define if a particular Database is big or not?

Certainly not. What it mean is you will apply the concept differently while evaluating different design/operational/administrative aspects of your Database. It also means that every time this concept will be nebulous.

As an example: Database Index strategy (an aspect of Database design) is impacted by record count for each table (a measure of “size”), by record size times record count (another measure of “size”), and by Query Vs. Creation/Update/Delete operations ratio (an aspect of Database usage).

Query response times are better if indexes are used for tables with large amount of records. Depending on the nature of your WHERE, ORDER BY and record-aggregation clauses you may need several indexes for certain tables.

Creation, Update and Delete operations are impacted negatively with the increase of number of indexes on the affected table(s). More indexes for an affected table means more changes that the RDBMS must perform, spending more time and more resources to apply those changes.

Also, if your RDBMS spends more time to apply those changes, then the locks are maintained for longer times also, impacting the response times other queries being sent to the system at the same time.

So, How do you balance the quantity and design of your indexes? How do you know if you need an additional index and if by adding that index you will not be introducing a big negative impact on query response times? Answer: You test and profile your database against a target load as per your load/performance requirements and analyze the profiling data in order to discover if further optimizations/redesigns/indexes are needed.

Different Index strategies are required for different Query Vs. Creation/Update/Delete operations ratios. If your Database is under a heavy load of queries but is rarely updated, the performance for the overall application will be better if you add every index that improves query response times. On the other hand, if your Database is constantly being updated but there are not large query operations, then the performance will be better if you use less indexes.

There are other aspects of course: Database Schema design, Storage Strategy, Network design, Backup strategy, Stored Procedures/Triggers/Etc. programming, Application Programming (against the Database), Etc. All these aspects are impacted differently by distinct concepts of “size” (record size, record count, index size, index count, schema design, storage size, etc.).

I'd like to have more time as this topic is fascinating. I hope this small contribution serves as an starting point for you in this fascinating world of SQL.

vmarquez
A: 

You have to account for hardware advancement for this definition:

  1. Small database: working set fits into the physical RAM of a single commodity server (about 16GB now)

  2. Medium database: fits into a single or several (through RAID) commodity hard drives on a single machine (up to several TBs now)

  3. Large database: Data needs to distributed across multiple commodity servers in order to fit (up to several PBs now.)

obecalp
A: 

If you have a database that is large enough that you can't just "back it up" to put on a development or test box, you likely have a "large database".

pearcewg