views:

590

answers:

11

I have to maintain an application that has a lot of columns that are of the text data type, with multiple values inserted into them delimited with commas, slashes or sometimes even the pipe (|) character. I'm trying to figure out why on Earth you would ever want to do this.

For an example, an orders table has a column called details that contains information like:

2x #ABC-12345  Widget, Black: $24.99 /4x #ABC-12344 Widget, Blue: $23.50

where the / separates the line items; there is VBScript code which reads the value from a recordset and parses it out in a For loop for display using something like (and this is pretty much exactly how the code reads, variable names and all) arydtls = split(rstmp("details"), "/"). This method is repeated throughout the code for various tables.

It seems to me like it would be 100x better (not to mention easier to work with) to just have the details in a separate table and link back to it (funnily enough, for Orders it does do this, but the data doesn't always match the details text field because the OrderDetail table is updated in code; the details field is treated as read-only in the application).

Did my predecessor know something I didn't, or am I right to be saying "WTF?!!" when I look at this schema? It seems like it's insanely inefficient and difficult to maintain like this, and it makes running reports extra difficult because the data I need could be contained in text fields OR it might be in one of a dozen tables that have similar information and are used in different parts of the application.

+7  A: 

The two likeliest scenarios are:

  • Your predecessor was incompetent / didn't understand normalization
  • Your predecessor ran into some performance problems with the normalized structure and found this method was an improvement

Since normalization can often be very expensive when it comes to query operations, we can sometimes get performance gains by eliminating an expensive join and doing the manipulations on the application side against a single row.

There is no absolute rule for database design that says "storing delimited values in a single row is better for this scenario". It's all about testing against your specific data sets and your usage patterns and making improvements where necessary.

In my experience it's not very common for this pattern to be an improvement over normalization though... that's pretty atypical.

Edit: A third possibility is that having n-values per row was a change from the original schema, and instead of adding a new table your predecessor resized the column. That's not necessarily different from the "incompetent" option :) but there are sometimes political pressures involved in db schema changes...

Rex M
That's what I thought. I'll just put it this way: I could supply The Daily WTF with probably a month's worth of articles just based on this guy's code.
Wayne M
Let me add one more potential scenario:* The text data was imported from some legacy system (possibly excel/word tables) and was intended to be temporary storage until the data was parsed out into a normalized structure. The double storage of the Orders data seems to point to that.
JohnFx
@JohnFx - good point, though I would imagine the asker would be aware of the fact that this data was ported from a legacy system and his predecessor didn't design the DB.
Rex M
I actually saw the exact same thing on an application I inherited from someone. I saw it and assumed incompetence.
Tina Orooji
Assumption is the mother of all f***ups. If you assume incompetence you will never find the real reason. I'm not saying he isn't incompetent but you need to look beyond that to find his reason. It may be he knew no better, it may also be he knew a lot more than you. Somewhere inbetween is the answer
Robin Day
@Robin Day has it right, which is why I pointed out both as equal possibilities.
Rex M
The last time I saw this, I asked if the designer had worked on the Pick OS (now pretty much defunct). The designer had, and transferred Pick design ideas to Oracle (where they didn't work nearly as well).
David Thornley
@Robin Day Point taken. That is true. There may have been a reasonable explanation for this and although I couldn't ask the developer about it, I could (and probably should) have looked further. There were other problems that contributed, but in hindsight I should have at least considered it.
Tina Orooji
If I'm not mistaken, I believe the Microsoft aspnet_Profile Table for ASP.Net Memberships holds delimited data.
madcolor
A: 

WTF really. Never store such things in the DB.

Dev er dev
A: 

Your predecessor had maybe some other ideas and this was left unfinished???

I can tell you that this is very bad for performance

How would you create a query that will return who bought a blue widget? You will have to scan the whole table and then parse that info, if there was another table and this was normalized then this would be much better performance wise

SQLMenace
Judging from how he did it, I think he added that later on.. there *is* a Details table that (sometimes) contains the same information as the big details column, but is used in other areas of the application. Makes no sense to me why he did it that way!
Wayne M
A: 

I've seen a database in a certain piece of enterprise software that has this in tons of places. It's pretty terrible, both from a maintenance perspective and from a performance perspective. The reasons cited are generally:

  • it's "simpler" because it doesn't require joins
  • it's faster because it doesn't require joins
  • it doesn't clutter the database with lots of tables

Now, the first point is probably true, but it's only "simpler" until you want to query against it. Now you're screwed. So I'd say that's effectively refuted. The second point is again true, as long as you're not querying against it. As soon as you have to read in the entire table, parse the data, then do the filtering of rows in your app, you lose. The last one is always true, but who cares if the database is "cluttered"? That's what it's for! Decent RDBMSs will allow you to put your tables into multiple schemas anyways, which are somewhat like namespaces and help combat the clutter. A good naming convention helps too (but if you use hungarian warts, so help you $deity).

In short, it's a bad idea. I hope you're allowed to fix it, but most likely you're just going to have to deal with it on its original terms...

rmeador
A: 

In operating systems such as Universe, UniData data is stored in files delimited by something like

Char(254) = separates properties Char(253) = separates multi values in a property Char(252) = separates sub sub values and so on

Shocking isn't it :-) Whenever I talk to ex colleagues who still work with DataBasic and they ask what DB I use the first question they ask is "Does it handle multi values okay?"

In an RDBMS we would have an Order table and an OrderLine table. The PK on the OrderLine would most likely be something like OrderNumber,LineNumber.

In UniData etc what they would do is to have a property in Order called "Lines" which would hold a list of keys for the OrderLine file, the composite key usually being separated with an asterisk.

  • 1234*1
  • 1234*2
  • 1234*3
  • etc

Then when they load their order into memory from file they have a list of keys they need to load OrderLines from the OrderLine file. Note that these are files and not tables :-)

It looks to me like someone experienced with this old way of storing data has tried to use a relational database, not understood it at all, and then tried to make it work like UniData.

Sack them :-)

Peter Morris
+1  A: 

Quite simply he either had a reason or he didn't, without asking its impossible to know. If you make the assumption that he wasn't a total idea and thing of some possible reasons then maybe its one of the following.

If the data was for information only and "never going to change" as you hear so often then it may have been a quick win just to throw a display string straight to the field. After all, just replacing pipes with Tabs and slashes with BR's to put it on the screen is incredibly easy. If the code had to be written extremely quickly then this might've been the easy option.

A new feature since SQL 2005 is the XML data type. A major use of this is that you can store and index an unknown number of values against a particular record. You might care about the colour of one thing, the dimensions of another, the weight of something else. You might not be able to produce a definitive list of these things and a truly normalised generic method of storing this data may be too slow or overcomplicated for the system. This may have been a work around to try and get similar functionality.

The key thing here is, most things are done for a reason. You've looked at it the right way in trying to find out this reason. You might come accross it one day and think "Oh yeah!". Just looking at something from your own perspective can often lead to a can't see the wood for the trees scenario.

Robin Day
A: 

I can't say what your predecessor was thinking. As Rex M said, sometimes political pressures result in strange implementations.

A great many people who stuff a list of items into a single value in tables are trying to get around the restrictions of (old style) first normal form. The down side is that queries have to be done programatically in the app instead of using a simple citerion in a WHERE clause.

About 10 years ago, Oracle added the capability of putting a table inside a value. About the same time, Date redefined 1NF so that all relations are automatically in 1NF. This includes rleations that contain other relations. Without that feature, the simplest and most powerful design is to break the repeated item out into a separate value, with one row for each item.

(Example: a list of courses a student is enrolled in)

In a lot of cases, the root coause is the designers ignorance or stubbornness. Again, I don't know what restrictions your predecessor faced. Don't imitate him unless you have to.

Walter Mitty
A: 

Why would you do something like that?

Back about mumble decades ago, my wife worked on the Pick system, which included a database and a BASIC and such. The Pick database and language worked well with putting arrays into database fields (not sure if I should call them columns). So, there was an environment where this made perfect sense.

I don't know if Pick is still around, but I haven't heard of it in a long time. It's possible that this table was a Pick database translated (badly) into a SQL-based database, and it's possible that whoever wrote it was a former Pick developer who hadn't learned how to use a relational database well at the time.

Last time I ran into a database like that, I asked. Turned out it was designed by a former Pick developer.

I wouldn't call this design competent, unless this was truly intended as a write-only ignorable field, but it may well be that the designer wasn't stupid.

David Thornley
The tables were freshly created with SQL Server, and only done a few years ago, so I doubt it. I really think he just couldn't think of any better way to do it and just did it the quickest way he could think of.
Wayne M
Given that, you're probably right. It at least seemed like a more charitable explanation than anything else that came to mind.
David Thornley
A: 

One possible, somehow valid reason, could be that the data structure is not fixed, the detail attributes are very different with order instances.

It is not easy to work with dynamic attributes in a static structure like that imposed by a database. An XML structure for example is more suitable for such a scenario, but giving the inherent verbosity of the xml, the 'csv like' approach could have been a more attractive alternative.

Aleris
+2  A: 

Did my predecessor know something I didn't, or am I right to be saying "WTF?!!" when I look at this schema?

No, your predecessor didn't. Yes, you are right. See note at the end, however.

It seems like it's insanely inefficient and difficult to maintain like this, and it makes running reports extra difficult because the data I need could be contained in text fields OR it might be in one of a dozen tables that have similar information and are used in different parts of the application.

It is insanely inefficient. See note at the end, however.

A column should always be an indivisible attribute of the row. I see two copies of three (maybe four) attributes in this column you've shown:

2x #ABC-12345  Widget, Black: $24.99 /4x #ABC-12344 Widget, Blue: $23.50
  • quanity (2x/4x).
  • code (#ABC-12345/#ABC-12344).
  • description (Widget, Black:/Widget, Blue:) [may be description and color attributes].
  • price ($24.99/$23.50).

This would have been better designed as:

StockItems
    Code char(10) primary key
    Desc varchar(50)
Transaction
    TxnId something primary key
    : : :
TransactionPart
    TxnId something \
    TxnSeq int      / primary key
    Quantity integer
    Code char(10) foreign key StockItems(Code)
    Price float

NOTE:

It's possible that this was done to preserve historical information in the face of changing values elsewhere in the database. For example, if the description of a stock item changes or the item is deleted.

However, that's still not the right way to handle it. In that case, foreign key constraints would have stopped the item code from being deleted and processes should have been in place to prevent the description from being updated (such as versioning the stock item codes).

Of course, if you're never going to search on any of the items within that column, this is perfectly valid, though unwise in terms of possible future functionality to search on them.

Maybe the only thing ever searched on in this table is the customer code - then a free-format text field is adequate.

I still wouldn't do it that way but a YAGNI argument can be made that it would be better to change the DB schema in the future, if and when that search functionality needs to be added.

paxdiablo
A: 

Looks like a WTF to me. It's inconsistent with how other tables are implemented, and it's definitely not efficient. And when you look at the schema without knowing the data inside, it would be easy to misunderstand a column's sense.

However, there might be a reason as to why the past developer has done this, could you give us more info, like on the on the business logic? Thanks

mives