views:

2288

answers:

11

Update 2009.04.24

The main point of my question is not developer confusion and what to do about it.

The point is to understand when delimited values are the right solution.

I've seen delimited data used in commercial product databases (Ektron lol).

SQL Server even has an XML datatype, so that could be used for the same purpose as delimited fields.

/end Update

The application I'm designing has some many-to-many relationships. In the past, I've often used associative tables to represent these in the database. This has caused some confusion to the developers.

Here's an example DB structure:

Document
---------------
ID (PK)
Title
CategoryIDs (varchar(4000))


Category
------------
ID (PK)
Title

There is a many-to-many relationship between Document and Category.

In this implementation, Document.CategoryIDs is a big pipe-delimited list of CategoryIDs.

To me, this is bad because it requires use of substring matching in queries -- which cannot make use of indexes. I think this will be slow and will not scale.

With that model, to get all Documents for a Category, you would need something like the following:

select * from documents where categoryids like '%|' + @targetCategoryId + '|%'

My solution is to create an associative table as follows:

Document_Category
-------------------------------
DocumentID (PK)
CategoryID (PK)

This is confusing to the developers. Is there some elegant alternate solution that I'm missing?

I'm assuming there will be thousands of rows in Document. Category may be like 40 rows or so. The primary concern is query performance. Am I over-engineering this?

Is there a case where it's preferred to store lists of IDs in database columns rather than pushing the data out to an associated table?

Consider also that we may need to create many-to-many relationships among documents. This would suggest an associative table Document_Document. Is that the preferred design or s it better to store the associated Document IDs in a single column?

Thanks.

+29  A: 

This is confusing to the developers.

Get better developers. That is the right approach.

Tom Ritter
Agreed 100%. A joiner table is the way to go.
Paul Tomblin
true, get rid of them and their stinking CSV data
SQLMenace
If your developers are dealing with a database system and encouraging you (directly or indirectly) to violate freaking FIRST normal form, then yeah, they should be on on the curb.
Adam Robinson
Tell us where you work, so we can all stay away.
RibaldEddie
Educate your developers on how associations work. Draw pictures. Write sample code. Administer beatings until they stop acting confused.
S.Lott
I second the beatings.
Fake Code Monkey Rashid
+13  A: 

It's almost always a big mistake to use comma separated IDs.
RDBMS are designed to store relationships.

Greg
+6  A: 

The many-to-many mapping you are doing is fine and normalized. It also allows for other data to be added later if needed. For example, say you wanted to add a time that the category was added to the document.

I would suggest having a surrogate primary key on the document_category table as well. And a Unique(documentid, categoryid) constraint if that makes sense to do so.

Why are the developers confused?

Arthur Thomas
+13  A: 

My solution is to create an associative table as follows: This is confusing to the developers

Really? this is database 101, if this is confusing to them then maybe they need to step away from their wizard generated code and learn some basic DB normalization.

What you propose is the right solution!!

SQLMenace
+8  A: 

The Document_Category table in your design is certainly the correct way to approach the problem. If it's possible, I would suggest that you educate the developers instead of coming up with a suboptimal solution (and taking a performance hit, and not having referential integrity).

Your other options may depend on the database you're using. For example, in SQL Server you can have an XML column that would allow you to store your array in a pre-defined schema and then do joins based on the contents of that field. Other database systems may have something similar.

Ilya Haykinson
-1 for proposing to violate 1NF.
Adam Robinson
+1 for strongly suggesting normalization while providing an alternative that is still better than what the OP's developers were suggesting.
Iceman
While normalization is my preference, I'll definitely investigate use of this XML column. How would a query against this column look?
frankadelic
There's a discussion at http://www.eggheadcafe.com/community/aspnet/13/80066/xml-column--use-value-to.aspx about doing thatHowever, the discussion at http://social.msdn.microsoft.com/Forums/en-US/sqlxml/thread/47c3582e-ee17-4a3f-92f5-9c92401441ef probably gives the better option. In brief, if your XML column is schema-typed (as opposed to untyped) you can query with something like SELECT t1.* FROM t1 JOIN t2 on t2.x.exist ('/foo/bar[.=sql:column("t1.i")]') = 1 This should use native XML indexing in SQL Server.
Ilya Haykinson
+1  A: 

Your current solution is the best practice for creating many-to-many dimensions, see what Kimball has to say about it.

Your developers need to take off the training wheels.

Eric
Educate the developers!
Walter Mitty
+18  A: 

Your suggestion IS the elegant, powerful, best practice solution.

Since I don't think the other answers said the following strongly enough, I'm going to do it.

If your developers 1) can't understand how to model a many-to-many relationship in a relational database, and 2) strongly insist on storing your CategoryIDs as delimited character data,

Then they ought to be IMMEDIATELY REMOVED FROM THEIR POSITIONS, if not right out of the company, perhaps to the help desk or as harmless data entry monkeys. At the very least, they need an actual experienced professional to join their team who has the authority to stop them from doing something this stupid and can give them the years of training they are completely lacking.

Last, you should NEVER refer to them as "developers" again as this is an insult to those of us who actually are developers.

I hope this answer is very helpful to you.

Update

The main point of my question is not developer confusion and what to do about it.

Don't store delimited values in columns. Using XML to essentially do the same thing is almost always wrong, too. Storing XML in a column could make sense when it is treated as a "property bag" of sorts that is NOT regularly queried on by the database, but is sent whole to another consumer (perhaps a web server or an EDI recipient).

Emtucifor
+5  A: 

The 'this is confusing to the developers' design means you have under-educated developers. It is the better relational database design - you should use it if at all possible.

If you really want to use the list structure, then use a DBMS that understands them. Examples of such databases would be the U2 (Unidata, Universe) DBMS, which are (or were, once upon a long time ago) based on the Pick DBMS. There are likely to be other similar DBMS providers.

Jonathan Leffler
Since I wrote this answer, the U2 DBMS have been sold by IBM to Rocket Software (http://rocketsoftware.com/).
Jonathan Leffler
A: 

Store it as your separate table like you know you should.

If your developers are that ignorant, just make a view (depending on your RDBMS and requirements, possibly updateable) to show them the data as they want it. I wouldn't encourage this, as it just breeds bad design from their standpoint, but its an option.

rfusca
+2  A: 

This is the classic object-relational mapping problem. The developers are probably not stupid, just inexperienced or unaccustomed to doing things the right way. Shouting "3NF!" over and over again won't convince them of the right way.

I suggest you ask your developers to explain to you how they would get a count of documents by category using the pipe-delimited approach. It would be a nightmare, whereas the link table makes it quite simple.

John M Gant
Yes - the count of documents use case is a great example to demonstrate the limitations of the pipe-delimited approach.
frankadelic
+1  A: 

The number one reason that my developers try this "comma-delimited values in a database column" approach is that they have a perception that adding a new table to address the need for multiple values will take too long to add to the data model and the database.

Most of them know that their work around is bad for all kinds of reasons, but they choose this suboptimal method because they just can. They can do this and maybe never get caught, or they will get caught much later in the project when it is too expensive and risky to fix it. Why do they do this? Because their performance is measured solely on speed and not on quality or compliance.

It could also be, as on one of my projects, that the developers had a table to put the multi values in but were under the impression that duplicating that data in the parent table would speed up performance. They were wrong and they were called out on it.

So while you do need an answer to how to handle these costly, risky, and business-confidence damaging tricks, you should also try to find the reason why the developers believe that taking this course of action is better in the short and the long run for the project and company. Then fix both the perception and the data structures.

Yes, it could just be laziness, malicious intent, or cluelessness, but I'm betting most of the time developers do this stuff because they are constantly being told "just get it done". We on the data model and database design sides need to ensure that we aren't sending the wrong message about how responsive we can be to requests to fulfill a business requirement for a new entity/table/piece of information.

We should also see that data people need to be constantly monitoring the "as-built" part of our data architectures.

Personally, I never authorize the use of comma delimited values in a relational database because it is actually faster to build a new table than it is to build a parsing routine to create, update, and manage multiple values in a column and deal with all the anomalies introduced because sometimes that data has embedded commas, too.

Bottom line, don't do comma delimited values, but find out why the developers want to do it and fix that problem.

Karen Lopez