views:

3345

answers:

7

I have a contacts table which contains fields such as postcode, first name, last name, town, country, phone number etc, all of which are defined as VARCHAR(255) even though none of these fields will ever come close to having 255 characters. (If you're wondering, it's this way because Ruby on Rails migrations map String fields to VARCHAR(255) by default and I never bothered to override it).

Since VARCHAR will only store the number of actual characters of the field (along with the field length), is there any distinct advantage (performance or otherwise) to using, say, VARCHAR(16) over VARCHAR(255)?

Additionally, most of these fields have indexes on them. Does a larger VARCHAR size on the field affect the size or performance of the index at all?

FYI I'm using MySQL 5.

+1  A: 

It's good practice to allocate only a little over what you need. Phone numbers would never go this large.

One reason is that unless you validate against large entries, no doubt someone will use all there is. Then you might run out of space in your row. I'm not sure about MySQL limit but 8060 is the max rowsize in MS SQL.

A more normal default would be 50 imho, and then increase where need proves it.

dove
Thanks. I definitely agree about it being good practice. It's the performance aspect I'd really like clarification on
Olly
+4  A: 

I'm with you. Fussy attention to detail is a pain in the neck and has limited value.

Once upon a time, disk was a precious commodity and we used to sweat bullets to optimize it. The price of storage has fallen by a factor of 1,000, making the time spent on squeezing every byte less valuable.

If you use only CHAR fields, you can get fixed-length rows. This can save some disk real-restate if you picked accurate sizes for fields. You might get more densely-packed data (fewer I/O's for table scans) and faster updates (easier to locate open spaces in a block for updates and inserts.)

However, if you over-estimate your sizes, or your actual data sizes are variable, you'll wind up wasting space with CHAR fields. The data will wind up less densely packed (leading to more I/O's for big retrievals).

Generally, the performance benefits from attempting to put a size on variable fields are minor. You can easily benchmark by using VARCHAR(255) compared with CHAR(x) to see if you can measure the difference.

However, sometimes, I need to provide a "small", "medium", "large" hint. So I use 16, 64, and 255 for the sizes.

S.Lott
+8  A: 

Nowadays, i can't imagine it really matters any more.

There's a computational overhead to using variable length fields, but with the excesses of CPUs today, it's not even worth considering. The I/O system are so slow as to make any computational costs to handle varchars effectively non-existent. In fact, the price of a varchar computationally is probably a net win over the amount of diskspace saved by using variable length fields over fixed length fields. You most likely have greater row density.

Now, the complexity of varchar fields is that you can't easily locate a record via it's record number. When you have a fixed length row size (with fixed length fields), it's trivial to compute the disk block that a row id points to. With a variable length rowsize, that kind of goes out the window.

So, now you need to maintain some kind of record number index, just like any other primary key, OR you need to make a robust row identifier that encodes details (such as the block, etc.) in to the identifier. If you do that, though, the id would have to be recalculated if ever the row is moved on persistent storage. No big deal, just need to rewrite all of the index entries and make sure the you either a) never expose it to the consumer or b) never assert that the number is reliable.

But since we have varchar fields today, the only value of varchar(16) over varchar(255) is that the DB will enforce the 16 char limit on the varchar(16). If the DB model is supposed to be actually representative of the physical data model, then having fields lengths can be of value. If, however, it's simply "storage" rather than a "model AND storage", there's no need whatsoever.

Then you simply need to discern between a text field that is indexable (such varchar) vs something that is not (like a text or CLOB field). The indexable fields tend to have a limit on size to facilitate the index whereas the CLOB fields do not (within reason).

Will Hartung
A: 

It's far beyond just a need for disk space. Consider the way that your application handles memory paging, how frequently you need to perform swapping - these are big things to consider.

Let's think of an example where you're running on Windows. A typical workstation build of Windows uses 4kb memory pages. If you're using varchar(255) to store a phone number, you can hold 16 phone numbers in that region (4096 / 255), with 16 bytes of overhang. If you're using varchar(50), you're at 81 with 46 bytes of overhang (4096 / 50). If you use 16 bytes to store a phone number (reasonable even with markup, such as (123) 456-7890, which is 14 characters), you can store a whopping 256 phone numbers in a single page of memory.

Now consider a different problem: zip codes. Supposing you store zip codes as varchar, if you use varchar(255) you're still at your 16 limit for a single page. But let's say I have this table:

CREATE TABLE Addresses
{
    AddressID int PRIMARY KEY UNIQUE NOT NULL IDENTITY(1, 1),
    AddressLine1 varchar(40) NOT NULL, 
    AddressLine2 varchar(40) NOT NULL DEFAULT(''),
    City varchar(25) NOT NULL, 
    State varchar(6) NOT NULL, 
    ZipCode varchar(10) NOT NULL
}

Now, I want to query the database for all of the users in zip code 85282. Suppose there are a million rows: at 16 rows per memory page, that's potentially up to 62,500 page faults that need to occur in order to read every row and check the zip code. Alternatively, using my suggested field size of 10 here (which accomodates zip+4 in the US), you can fit 409 zip codes per page, resulting in 2,445 page faults max - a factor of 25 reduction!

When you consider the types of scalability we have, and the need to architect for performance, paging - while something we typically let the OS handle - is still a major factor since the disk is so incredibly slow compared to the physical memory supported by the system. The answer isn't to simply throw more memory at it - it's to be careful about how we build a system in the first place.

Rob
This is complete misinformation. What are you talking about?
Seun Osewa
The whole point of VARCHARs is they only consume enough space for their actual value. Perhaps you're thinking of CHAR?
bdonlan
+21  A: 

In storage, VARCHAR(255) is smart enough to store only the length you need on a given row, unlike CHAR(255) which would always store 255 characters.

But since you tagged this question with MySQL, I'll mention a MySQL-specific tip: when your query implicitly generates a temporary table, for instance while sorting or GROUP BY, VARCHAR fields are converted to CHAR to gain the advantage of working with fixed-width rows. If you use a lot of VARCHAR(255) fields for data that doesn't need to be that long, this can make the temporary table very large.

It's best to define the column based on the type of data that you intend to store. It's hard to know what the longest postal address is, of course, which is why many people choose a long VARCHAR that is certainly longer than any address. And 255 is customary because it may have been the maximum length of a VARCHAR in some databases in the dawn of time (as well as PostgreSQL until more recently).

Bill Karwin
+4  A: 

In my experience, if you allow a datatype of 255 characters, some stupid user (or some experienced tester) will actually fill that up.

Then you have all sorts of problems, including how much space you allow for those fields in reports and on-screen displays in your application. Not to mention the possibility of exceeding the per-row limit for data in your database (if you had more than a few of these 255 character fields).

Much easier to pick a reasonable limit at the beginning, then enforce that through the application and database.

BradC
+4  A: 

In addition to the size and performance considerations of setting the size of a varchar (and possibly more important, as storage and processing get cheaper every second), the disadvantage of using varchar(255) "just because" is reduced data integrity.

Defining maximum limits for strings is a good thing to do to prevent longer than expected strings from entering the RDBMS and causing buffer overruns or exceptions/errors later when retrieving and parsing values from the database that are longer (more bytes) than expected.

For example, if you have a field that accepts two-character strings for country abbreviations then you have no conceivable reason to expect your users (in this context, programmers) to input full country names. Since you don't want them to enter "Antigua and Barbuda" (AG) or "Heard Island and McDonald Islands" (HM), you don't allow it at the database layer. Also, it is likely some programmers have not yet RTFMed the design documentation (which surely exists) to know not to do this.

Set the field to accept two characters and let the RDBMS deal with it (either gracefully by truncating or ungracefully by rejecting their SQL with an error).

Examples of real data that has no reason to exceed a certain length:

  • Canadian Postal Codes are of the format A1A1A1 and are always 6 characters in length, even for Santa Claus (6 characters excludes the space that can be specified for legibility).
  • email addresses - up to 64 bytes before the @, up to 255 bytes after. Never more, lest you break the Internet.
  • North American Phone Numbers are never more than 10 digits (excluding the country code).
  • Computers running (recent versions of) Windows cannot have computer names longer than 63 bytes, though more than 15 is not recommended and will break your Windows NT server farm.
  • State abbreviations are 2 characters (like the country codes exampled above)
  • UPS tracking numbers are either 18-, 12-, 11-, or 9-characters long. The 18-character numbers start with "1Z" and the 11-character numbers start with "T" which makes you wonder how they deliver all those packages if they don't know the difference between letters and numbers.

And so on...

Take the time to think about your data and its limits. If you're a architect, developer, or programmer, it's your job, after all.

By using a varchar(n) instead of varchar(255) you eliminate the problem where users (end-users, programmers, other programs) enter unexpectedly long data that will come back to haunt your code later.

And I didn't say you shouldn't also implement this restriction in the business logic code used by your application.

shufler
Canadian postal codes actually have 7 digits, the space in the middle is important, and should be shown on mailing labels. North american phone numbers may have more than 10 digits if there is an extention. If you are OK not being able to store phone number extensions, then 10 digits is fine, but you will probably regret it.
Kibbee
There is definitely a case for being restrictive for data integrity. Though, it's still easy to be too restrictive. Impose restrictions for data you control, and impose _sane_ restrictions for data requirements you can't control. Your phone number and e-mail restrictions are sane (assuming you never internationalize). Your requirement that says that truncating a two character country code is the "graceful" thing is insane. You know there was an error, don't truncate and accept. If you truncate there is an extremely high probability you'll end up with an incorrect country code.
coderjoe