views:

230

answers:

5

I'm trying to foresee how big my database will be. Let's say I have only one table:

CREATE TABLE user (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
email VARCHAR(50),
pass CHAR(40),
url VARCHAR(1000),
PRIMARY KEY(id));

Adding all up: 4 + 51 + 40 + 1001 = 1096 bytes in one record.
If I have 1 million records: 1,000,000 x 1096 bytes = 1,045 Mb!

So, it is one tiny table and I'm looking at 1 gig to store it. Am I correct in my estimation?

A: 

Actually a varchar field represents more than one char field. This is also true of other datatypes as well.

An easy way would be to add 100 records with random test data then look at how big the database file is in your file system. Then add another hundred and see how much larger it grew.

mugafuga
Isn't VARCHAR potentially less than 1000 if you don't use it too?
kenny
Also remember that a varchar requires a few bytes to store the length of the field too, in addition to the data.
tmeisenh
+1  A: 

As the previous answer suggested, the varchar field makes things a little more difficult as it only uses enough storage for the string it contains in each row. After entering some sample data, a database such as MySQL (i assume others do this too) will be able to tell you the average size of each row.

Also, you need to keep in mind any indexes you create which will take up additional space. Again, using a range of sample data should give a rough indication to extrapolate the size estimates.

Edit: As many of the answers here suggest using sample data, please see my answer to and older questions relating to this: PHP Script to populate MySQL tables

Jarod Elliott
Thanks! Gotta give it a try.
z-boss
+1  A: 

Actually, the space usage of MySQL's VARCHAR type is variable, based on the data put into it. The CHAR type has constant space usage. Aside from that, your computation looks correct: AFAIK, MySQL tables aren't stored on disk compressed, though you can explicitly compress them at the cost of making them read-only.

hark
+2  A: 

Load some test data with the same average field lengths as production, then measure how much space it takes. That's what we do.

Don't bother loading 100 rows, just load 1M rows, or 10M from the start. Loading more rows into non-production systems is easy - it only takes a little longer.

It is really handy to have a large set of test data - that way you can accurately measure the disc space impact (and downtime impact) of database schema changes, for example adding indexes in particular.

Make sure your test data are at least as large as production, ideally bigger.

MarkR
Good idea, but where to get such set of test data?
z-boss
+2  A: 

Apart from the varchar issue you also need to be aware that most databases store the records in allocated blocks of storage (sometimes termed extents - although the precise terminology depends on the rdbms) which contain a certain amount of free space. The intention of this is to allow updates whilst minimizing table and index fragmentation. Of course allocated free space increases database file size even though there's no actual data in it.

These overheads can generally be specified and controlled when creating the table using rdbms-specific clauses, and even virtually eliminated if it's a read-only snapshot. OTOH you might want to make this padding larger than usual if your table is going to see a lot of IUD activity.

A good rule of thumb is to calculate you expected table size as you do - although guesstimating on varchar sizes as discussed in other posts (or better running analysis on sample data) then add 20% - a common default free space allocation. Actually in practice it's unusual for free space allocation to cause an issue, especially if you deploy a sensible maintenance routine (so most folks never think about it), but failure to anticipate and make suitable allocation on a table hit by unusually high IUD activity can give rise to tricky to trace performance problems.

To be honest in these days of 600Gb disks as commonplace it's a long time since I've seriously sized a database at any level other than a quick guesstimate.

*EDITED to reply to comment - "What is IUD and what do you mean by maintenance? Removing old records? – sneg"

IUD = Insert Update Delete activity. To illustrate the maintenance issue lets consider what would happen if we created a database with no free space and loaded a table like the one you propose with records containing varchar data. All records would be placed in our database file end to end, with no space between them.

If the user then updated the varchar part of the record there's three possibilities. If the field is the same length then there's no structural change. If it's shorter we overwrite the old field and there's a few spare bytes at the end of the field - no biggie. If however it's longer then we have a problem - the record will no longer fit. In that case one solution would be to copy the whole amended record to a new location and update the indexes (and in some management schemes drop a pointer in where the old record was). The issue now is that a sequential read of the data - not an uncommon operation - will now have to jump around the database file rather than read straight through - a classic fragmentation scenario - and performance will gradually get hosed.

By allocating free space to the table then when we're updating we have a certain amount of elbow room that allows us to change the record length without having to move the record from the page. Of course over time if the table sees a lot of activity it will still fragment (as we only allocate enough free space to cover some percentage of record changes in place), which is where maintenance comes in.

Maintenance in this case is essentially a defragmentation process to move records around so they are repositioned and free space allocated so that they are distributed efficiently again. In some (most) RDBMs you can just designate a maintenance plan and schedule a job to do this at q quiet time (SQL Server for instance) but in others you might have to do it manually - for example in older versions of Oracle the recommended approach was to export the data, drop the table and recreate, then re-import again from backup - the export/reload process would clean up the data as per any fresh load.

Index structures have similar issues.

I am of course glossing over a lot here, but the essential issues of storing variable length random access data records in a file will remain, no matter how many abstraction layers you pile on top of it. The good thing is that this sort of problem is well understood and most of the time it's not something you need to worry about - until you ask an apparently simple question like 'how much space will this table require' :-)

Cruachan
What is IUD and what do you mean by maintenance? Removing old records?
z-boss
Thanks a lot, Cruachan!
z-boss