views:

80

answers:

4

I'm working on data migration from an old IBM Universe-based system to a new enterprise-level data-information management system, and learning about database design in the process.

I took a look at the back-end database structure of the new system (it's a MS SQL DB, with about 100 tables), and find some things pretty odd. But I don't know if my inexperience is the reason I think that and this is just standard practice, or if these oddities are really just bad database/application design.

For example:

  • Some date fields are varchar(20)
  • Fields that store measurements are varchar(50), rather than something like a decimal and an enum to store the units of measurement
  • ISBN 10 & 13 number fields are varchar(50)
  • Some look-up ID foreign keys are varchar(100), even though the actual look-up table primary key is an int
  • Some fields are varchar(0)
  • Additional separate fields for storing month & year, each of which is varchar(250) - I don't know what kind of design decision would require a maximum of 250 characters for the year, unless they really went overkill on their Y2K compliancy, or decided to use seconds since the beginning of the universe to store datetime

And plenty of others. The DB looks to be more than half varchar fields.

I should also mention that all the varchar fields in the DB are actually n-varchar - so it's all unicode, even the fields that only store numbers.

Is there any legitimate argument that using so many varchar fields could be the best option, in some circumstances? (flexibility... maybe...?)

A: 

Nope. I'd change it if it were mine. Do you know who made those decisions? If they're still around you can ask them.

Beth
Sadly this new system is from another company - it wasn't developed internally, and I can't talk to the developers about it.
Kyle Lowry
+2  A: 

Some date fields are varchar(20)

This one of the things that will always get you in trouble in the future, now you can have invalid dates and then you can't do normal date arithmetic.

Some look-up ID foreign keys are varchar(100), even though the actual look-up table primary key is an int

This is bad because you will get conversions when you join and this will make it slower

Store decimals as decimals...sooner or later you will get bad data in there and then it will be a classic case of GIGO (Garbage In Garbage Out)

Also using nvarchar to store numbers is insane, you just doubled the storage needed to store those numbers, this will then store less rows per page and you will need more IO to bring back the same number of rows if had used regular varchars or integers

SQLMenace
Thanks - the unicode issue was something that concerned me too. With all the data we'll be putting into the new system, I'm afraid that it will run very slowly, among other things. The look-up table alone is massive. Tons and tons of stuff in there that probably could have been built into the DB design itself.
Kyle Lowry
+2  A: 

It does seem odd, but it really depends on how the data is used. There could be very good reasons for the use of varchar. If there is no need to use the fields in criteria or perform calculations, using varchar would give the user a lot more freedom to do what they want.

For example, in real-estate, it seems like the price of a home should be numeric. However, many agents want to display phrases like "call for pricing", "in the low 300s", etc (though we keep a separate numeric price field for searching).

I would suggest looking at how the fields are used to determine if they should be varchar or not. If you see a lot of conversions from varchar to the type it should be, then varchar is probably not the right choice.

Brian
That's what I was thinking. This is kind of a generic system in some ways. And I can see that some fields should be varchar, but many others don't make sense to me. For report generation (which is, in theory) a major benefit of the new system, all these varchar fields are going to make it difficult/impossible to search, sort, and group data for reports.
Kyle Lowry
Depending on how carefully the data is entered, it could make reports inaccurate. If people are basing business decisions on it, this is worse then no report at all. People tend to trust computers more than their own personal judgement. If the report is telling them something wrong (eg, the report says they've sold 100 units of something in a given time frame when they actually sold 1000), they will probably make poor decisions that could effect the stability of the company.
Brian
+1  A: 

Some of these are clearly problems, especially the "dates as text" and "foreign keys that don't match the datatype of their related key".

The "ISBN 10 & 13 number fields as varchar(50)" isn't quite as clear-cut. Sure, it would work to store it as a BIGINT, but there are some good arguments for using a CHAR(10) or CHAR(13) instead: (even though it uses slightly more storage. Varchar(50) is clearly overkill)

  1. Will you ever need to do mathematical operations with this number? (no)
  2. Will you frequently "pretty-format" it? (00-0000-00-0 or something. Its easier to perform formatting operations on strings)
  3. Will you ever have to do a LIKE comparison? WHERE convert(varchar(13),ISBN) LIKE '%123%' is pretty ugly.

So depending on exactly how it will be used, I wouldn't have an issue using the CHAR instead. Actually, you could argue that VARCHAR(13) would make sense if a significant number of rows will have no ISBN (less storage).

BradC
I never thought about the ISBN issue that way. I, unfortunately, don't have access to the back-end of the user interface - so I can only imagine what the queries are that get executed in the course of normal operation.
Kyle Lowry
a string ISBN column invites the "are there dashes or formatting stored in he column?" question which leads to problems and inconsistencies. The extra long 50 gives more rope to hang your neck from. The application might be coded one way, but 3 years from now a new developer codes something a different way, and it "fits into the column, so it must be right". A varchar(50) invites the CSV multiple values syndrome. When defining column you must pick the strictest data type possible, this is the last chance level of enforcing business logic and any form of data standardization.
KM
I'm thinking data standardization wasn't a big priority for the developers of this system. Nor was normalization in general. *sigh*
Kyle Lowry