views:

173

answers:

5

It is frequently advised to choose database field sizes to be as narrow as possible. I am wondering to what degree this applies to SQL Server 2005 VARCHAR columns: Storing 10-letter English words in a VARCHAR(255) field will not take up more storage than in a VARCHAR(10) field.

Are there other reasons to restrict the size of VARCHAR fields to stick as closely as possible to the size of the data? I'm thinking of

  • Performance: Is there an advantage to using a smaller n when selecting, filtering and sorting on the data?
  • Memory, including on the application side (C++)?
  • Style/validation: How important do you consider restricting colunm size to force non-sensical data imports to fail (such as 200-character surnames)?
  • Anything else?

Background: I help data integrators with the design of data flows into a database-backed system. They have to use an API that restricts their choice of data types. For character data, only VARCHAR(n) with n <= 255 is available; CHAR, NCHAR, NVARCHAR and TEXT are not. We're trying to lay down some "good practices" rules, and the question has come up if there is a real detriment to using VARCHAR(255) even for data where real maximum sizes will never exceed 30 bytes or so.

Typical data volumes for one table are 1-10 Mio records with up to 150 attributes. Query performance (SELECT, with frequently extensive WHERE clauses) and application-side retrieval performance are paramount.

+3  A: 

I think that the biggest issue is data validation. If you allow 255 characters for a surname, you WILL get a surname that's 200+ characters in your database.

Another reason is that if you allow the database to hold 255 characters you now have to account for that possibility in every system that touches your database. For example, if you exported to a fixed-width column file all of your columns would have to be 255 characters wide, which could be pretty annoying or even problematic. That's just one example where it could cause a problem.

Tom H.
I'm also thinking that the overall rowsize might be an issue. Even though you're allowed to exceed 8KB rows in SQL Server 2005, it's done in a slightly fiddly way which may impact performance -- the pages themselves are still 8K, from what I remember. So, if you use big columns everywhere, you're in more danger of exceeding the 8K page.Anyone else know more about the performance implications?
Matt Gibson
Yes, if you exceed the 8060 byte limit then the largest variable-length column will get moved to another location with a pointer added to the original page record. This will only happen if the true size (not the declared size) of the variable columns pushes it over the 8060 bytes. If you allocate VARCHAR(8000) for several columns, but they are all only holding a few characters of data then you won't have that problem. It can potentially happen though.
Tom H.
Many thanks -- you were the first to discuss row length, which was the crucial bit of information I was missing. I marked Thomas's answer as "correct" because it's easier to read for future readers.
chryss
A: 

One good reason is validation.

(for example) In Holland a social security number is always 9 chars long, when you won't allow more it will never occur.

If you would allow more and for some unknown reason there are 10 chars, you will need to put in checks (which you otherwise wouldn't) to check if it is 9 long.

rdkleine
+10  A: 
  1. Data Integrity - By far the most important reason. If you create a column called Surname that is 255 characters, you will likely get more than surnames. You'll get first name, last name, middle name. You'll get their favorite pet. You'll get "Alice in the Accounting Department with the Triangle hair". In short, you will make it easy for users to use the column as a notes/surname column. You want the cap to imped the users that try to put something other than a surname into that column. If you have a column that calls for a specific length (e.g. a US tax identifier is nine characters) but the column is varchar(255), other developers will wonder what is going on and you likely get crap data as well.

  2. Indexing and row limits. In SQL Server you have a limit of 8060 bytes IIRC. Lots of fat non-varchar(max) columns with lots of data can quickly exceed that limit. In addition, indexes have a 900 bytes cap in width IIRC. So, if you wanted to index on your surname column and some others that contain lots of data, you could exceed this limit.

  3. Reporting and external systems. As a report designer you must assume that if a column is declared with a max length of 255, it could have 255 characters. If the user can do it, they will do it. Thus, to say, "It probably won't have more than 30 characters." is not even remotely the same as "It cannot have more than 30 characters." Never rely on the former. As a report designer, you have to work around the possibilities that users will enter a bunch of data into a column. That either means truncating the values (and if that is the case why have the additional space available?) or using CanGrow to make a lovely mess of a report. Either way, you make it harder on other developers to understand the intent of the column if the column size is so far out of whack with the actual data being stored.

Thomas
+1 for item number two.
anonymous coward
very well said.
HLGEM
A: 

The other thing is that a single row of data is limited to 8060 bytes, and SQL Server uses the max length of varchar fields to determine this.

Reference: http://msdn.microsoft.com/en-us/library/ms143432.aspx

Jeremy
@Jeremy - I think you mean 8060.
Thomas
yup fixed. thanks!
Jeremy
A: 

1) Readability & Support

A database developer could look at a field called StateCode with a length of varchar(2) and get a good idea of what kind of data that field holds, without even looking at the contents.

2) Reporting

When you data is without a length constraint, you are expecting the developer to enforce that the column data is all similar in length. When reporting on that data, if the developer has failed to make the column data consistent, that will make the reporting that data inconsistent & look funny.

3) SQL Server Data Storage

SQL Server stores data on 8k "pages" and from a performance standpoint it is ideal to be as efficient as possible and store as much data as possible on a page.

If your database is designed to store every string column as varchar(255), "bad" data could slip into one of those fields (for example a state name might slip into a StateCode field that is meant to be 2 characters long), and cause unecessary & inefficient page and index splits.

Reagan Williams