views:

926

answers:

6

In a slightly heated discussion on TDWTF a question arose about the size of varchar columns in a DB.

For example, take a field that contains the name of a person (just name, no surname). It's quite easy to see that it will not be very long. Most people have names with less than 10 characters, and few are those above 20. If you would make your column, say, varchar(50), it would definately hold all the names you would ever encounter.

However for most DBMS it makes no difference in size or speed whether you make a varchar(50) or a varchar(255).

So why do people try to make their columns as small as possible? I understand that in some case you might indeed want to place a limit on the length of the string, but mostly that's not so. And a wider margin will only be beneficial if there is a rare case of a person with an extremely long name.


Added: People want references to the statement about "no difference in size or speed". OK. Here they are:

For MSSQL: http://msdn.microsoft.com/en-us/library/ms176089.aspx

The storage size is the actual length of data entered + 2 bytes.

For MySQL: http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html

L + 1 bytes if column values require 0 – 255 bytes, L + 2 bytes if values may require more than 255 bytes

I cannot find documentation for Oracle and I have not worked with other DBMS. But I have no reason to believe it is any different there.

+2  A: 

So why do people try to make their columns as small as possible? I don't believe in making them as small as possible, but sizing them appropriately. Some reasons for making (n)varchars smaller rather than larger:

1) With a larger field, all clients that use the database must be able to handle the full size. For example, take a system that holds a United States address with 255 characters per each field: (Similar to TDWTF that you reference, I believe.)

  • First Name
  • Last Name
  • Address Line 1
  • Address Line 2
  • City
  • State
  • ZIP Code

Now your data entry screens will need to allow and show 255 characters per field. Not hard, but unlikely to look nice with larger fields Printing invoices, you will need line breaking logic to handle the large fields. Depending on tool, not that hard.

But I would not want the problem of formatting the address for an envelope that could have 255 characters for each of those fields or just any one of those fields. Are you going to truncate if the field is too long to fit? Great someone has Address Line 1 of "House Number Streat Number ... blah blah blah ... Appartment number 111." And you'll lop off the important apartment number. Are you going to wrap? How much? What if you just can't fit it in the little box of space on the envelop? Raise an exception and have someone hand letter it?

2) While 10 characters of data held in a varchar(50) versus varchar(255) does not impact size or speed, allowing 255 characters allows for more space to be taken. And if all fields are that large you could hit size limits in SQL Server 2000. (I haven't read up on 2005 & 2008 to see if they can handle rows greater than one page.) And with Oracle you the larger sizes allows row chaining to happen if someone actually uses all the available characters.

3) Indexes have stricter size limits then leaf pages. You may preclude indexes, especially composite indexes, if you create your varchars too big.


On the other hand, I have a long line 1 for my address, and have been frustrated by web sites that don't allow the full thing to be typed.

Shannon Severance
1) If there really is a size limit (like the available space on an envelope), then it does make sense to put a limit there, that I already noted in my question. But I would rather prefer to make the limit in my client app than in the DB. Because if the limit ever changed, I would then only have to change the client app - in most cases a task far easier than doing a DB schema update.
Vilx-
2) Row size is a point, however since most real data will not reach these limits, it is quite unlikely that a row so big will be met. The large sizes are meant more for fail-safe than actual use.
Vilx-
3) Another point - but my thoughts are the same as in 2).
Vilx-
I fail to see how the excessive size on VARCHARs is a failsafe, since the open the possibility that some one will try stuffing in a record that meets all the data type constraints, but FAILS because the overall size is too large for the leaf node or index.
Shannon Severance
It is a failsafe for the poor person who indeed has a long name or address. And I fail to grasp why someone would try to fill up all the fields to their limit. A hacker maybe? Well then what do you care if the app throws an exception at him for it? Better for you, I'd say.
Vilx-
+4  A: 

I have heard the query optimizer does take varchar length into consideration, though I can't find a reference.

Defining a varchar length helps communicate intent. The more contraints defined, the more reliable the data.

Rob Elliott
I agree about the intent, but I do not see how it would make the data more reliable.
Vilx-
I should have said 'predictable' rather than reliable. The fewer values a variable can have, the fewer chances it has to do something I don't expect.
Rob Elliott
Unless you are programming in C/C++ or another language where you have to worry about buffer overflows, one string will be just the same as another in your code.
Vilx-
+1  A: 

I'll answer your question with a question: If there is no difference to the DBMS between a varchar(50) and a varchar(255), why would the DBMS let you make a distinction? Why wouldn't a DBMS simply say "use varchar for up to xxx characters, and text/clob/etc. for anything over that." Sure, perhaps Microsoft/Oracle/IBM might keep the length definition for historical reasons, but what about DBMS' like MySQL which has multiple storage backends- why does every one implement definable character column lengths?

Dan
Because sometimes you do need that length constraint? Although it would indeed then be better as a /constraint/ not a column datatype parameter. OK, I don't know. :)
Vilx-
Right. So aside from backwards compatibility there is no "on the surface" reason for the existance of such a feature. The suggests there being an deeply buried reason for it. I suppose it's possible that other DBMS' would do it for a "me too" reason, but I am of the belief that developers arne't adding features for the sake of adding features. At least, I don't :)
Dan
Or maybe it is a historical feature that comes from 80s, when databases had more primitive storage optimizers, and every little bit helped.
Pavel Minaev
+4  A: 

I can only speak for Oracle. A VARCHAR2(50) and a VARCHAR2(255) take up exactly the same amount of space and perform identically, if you enter the value 'SMITH'.

However, the reason why it is generally not a good idea to go around declaring all your textual columns as VARCHAR2(4000) is that column length is, effectively, another constraint. And constraints are database implementation of business rules, so they are definitely something that should be defined on the database side of things.

As a for-example. You define a CHECK constraint on a column so that the values it can accept are only 'Y' and 'N'. That saves your application from having to deal with 'y' and 'n' or even '1' and '0'. The check constraint ensures your data conforms to expected standards. Your application code can then make valid assumptions about the nature of the data it has to deal with.

Column length definition is in the same boat. You declare something to be a VARCHAR2(10) because you don't want it accepting an entry of 'ABC123ZYX456' (for whatever reason!)

In Australia, I define STATE columns to be a varchar2(3) because I don't want people typing in 'New South Wales' or 'South Australia'. The column definition pretty much forces them to be entered as 'NSW' and 'SA'. In that sense, a VARCHAR2(3) is almost as much a check constraint as actually specifying a CHECK IN ('NSW','SA','VIC' etc) constraint.

In short, proper column lengths are a way of encoding business rules. They're another form of constraint. They bring all the advantages of constraints (and suffer from many of the same drawbacks). And they ensure, to a small extent, a degree of 'data cleanliness' that "proper" constraints help with, too.

I don't buy the argument, either, that it's best to stick these sorts of things in the client app because it's easier to change there. You have 20,000 people using an app, that's 20,000 updates. You have one database, that's one update. The 'easier to change the client app' argument, if true, would potentially mean the database just gets treated as a giant bit bucket with all the clever logic being handled in client code. It's a big discussion to have, but since all RDBMSes let you define constraints and so on in the database itself, it's pretty clear that there's at least a worthwhile case to be made that such fundamental logic belongs in the backend.

It's a good answer, but it just leads to another question: so, then, what is a reasonable length constraint for a name? and is there even one?
Pavel Minaev
Upvoted for a good, detailed discussion! :)Still, there is nothing that I didn't say in my OP. Yes, there are cases, when it makes sense to put a limit on a string. Like the hash of a password or the "three letter code of a state". But in vast majority of cases varchar fields are for names and descriptions, where there is no obvious limit.
Vilx-
A: 

One important distinction is between specifying an arbitrarily large limit [e.g. VARCHAR(2000)], and using a datatype that does not require a limit [e.g. VARCHAR(MAX) or TEXT].

PostgreSQL bases all its fixed-length VARCHARs on its unlimitted TEXT type, and dynamically decides per value how to store the value, including storing it out-of-page. The length specifier in this case really is just a constraint, and its use is actually discouraged. [[1]]

Other DBMSs require the user to select if they require "unlimitted", out-of-page, storage, usually with an associated cost in convenience and/or performance.

If there is an advantage in using VARCHAR(<n>) over VARCHAR(MAX) or TEXT, it follows that you must select a value for <n> when designing your tables. Assuming there is some maximum width of a table row, or index entry, the following constraints must apply:

  1. <n> must be less than or equal to <max width>
  2. if <n> = <max width>, the table/index can have only 1 column
  3. in general, the table/index can only have <x> columns where (on average) <n> = <max width> / <x>

It is therefore not the case that the value of <n> acts only as a constraint, and the choice of <n> must be part of the design. (Even if there is no hard limit in your DBMS, there may well be performance reasons to keep the width within a certain limit.)

You could use the above rules to assign a maximum value of <n>, based on the expected architecture of your table (taking into account the impact of future changes). However, it makes more sense to define the minimum value of <n>, based on the expected data in each column. Most likely, you will expand to the nearest "round number" - e.g. you will always use either VARCHAR(10), VARCHAR(50), VARCHAR(200), or VARCHAR(1000), whichever is the best fit.

IMSoP
+1  A: 

Simple answer to this in my opinion is the fact that you cannot use that column as an index key, if you require any indexing you are basically forced to use fulltext... this is with regards to using a varchar(max) column. In any case 'right-sizing' columns makes a lot of sense whenever you [may] want to apply any indexing; updating variable length columns may be a costly maneuver as these are not done in place and can/will cause some amount of fragmentation.

All with regard to MS SQ-Server.

Hmm... looks like you're right. I had missed the part about indexes on varchar(max).
Vilx-