views:

371

answers:

2

I am having a discussion with someone about the following table that is used to link client-specific items:

Table LINK:

Client (int) 
Item1 (int) 
Item2 (int)

This is the disputed design. All three fields refer to other tables. The two Item fields refer to the same other table. These are not real field names, so don't bother with discussing naming conventions (the "1" and "2" are, however, really part of the field name). I am arguing that this design is bad on 1NF-violation grounds, while the other person is arguing that even though this seems distasteful, all other options are worse for our specific use-case.

Notes:

  • The vast majority of cases will only require linking two items with each other;
  • N:1 groups are however allowed; in such a case, the same Item1 is repeated on multiple lines with different Item2 values;
  • There are also a very small number of cases where some Item2 values (in an existing Item1-Item2 links) are themselves linked to other Items, and in these cases these values occur in the Item1 column, with the other linked value in the Item2 column; all linked items correspond to one group, and must be retrieved as such.

My claims:

  • This violates 1NF: Item1 and Item2 are foreign keys for the same table, and as such constitute a repeating group (the other party disagrees on the defn of repeating group);
  • For searches on Item, this means that two indexes are required instead of one, for example in a table that uses a GroupID field instead;
  • This makes queries looking for a specific Item in this table more complex, because a restriction clause must examine both Item1 and Item2 fields.
  • The retrieval for the case where chains of Item links occur will be more complex.

The other side claims:

  • The most viable alternative is a table with a single Item field, and an additional GroupID field;
  • The simpler, more common two-item link case now becomes more complex;
  • There could be concurrency issues in obtaining GroupID slots, and that needs to be managed
  • Managing the GroupID concurrency issues probably requires a second table with GroupIDs in a field with a uniqueness constraint
  • You now have to perform a join, at least some of the time, especially if an ORM is used. The join is less efficient than using a single table as in current design.

I would like to hear some opinions about this. I have read the other posts on SO about database design, and especially 1NF, but they don't deal as specifically with my case above as I would have liked. I have also come to understand, based on much research online, that so-called standards like 1NF can be defined in many different ways by different people. I have tried to be as clear as possible about both arguments, and not to bias one or the other.

EDIT 1:

  • Item1 and Item2 are (financial) transactions
  • The "1" and "2" are really part of the field name
+1  A: 

What are Item1 and Item2? Are they distinct entities? Then the design seems fine to me.

For example, you might want to fill a database with solutions to the traveling salesman problem. You have a table City(cityId, latitude, longitude), and a table Path(pathId, salesmanId). Now a path where the salesman visits n+1 cities would be represented by n entries in PathSegment(pathId, segmentId, fromCityId, toCityId). Here, although fromCityId and toCityId are foreign keys that reference the same table City, they describe different attributes of the PathSegment entity, hence this does not violate NF1.

Edit:

So you want to store trees, actually, only your trees are mostly just linked lists, and most of those are linked lists with just two nodes, right? And apparently your coworker wants to do this as an adjacency list, so a tree like

1-2-3
\-4

becomes

(1,2)
(2,3)
(1,4)

There's nothing wrong with that, but it's not the only way to store a tree in a database. For a good summary of alternatives, see here.

In your case, the advantage of using an adjacency list is that most of your trees have only two nodes, so most of them end up being one row in the table, keeping that simple. Also, questions about the immediate neighbours are easy. "What's the invoice for this payment?" becomes

select item1 from link where item2 = :paymentID

which is neat, too. There are drawbacks, though. The order of child nodes often matters, and the list doesn't help you here, so you have to store that either as a separate column or as something like timestamps in the tables your foreign keys are referring to). Also, reconstructing an entire branch becomes a recursive task, and not all database systems can do that. So if your application often has to retrieve a message-board-like overview of the invoice history, it might require some application-side logic that turns the list of adjacent nodes into a tree on the client and works on that. If that becomes too cumbersome, you might want to consider a nested sets representation, see here.

What's best for your problem? Depends on several things: size and shape of your trees (if they are really mostly short linked lists, adjacency list is good), frequency of inserts and updates (if frequent, adjacency list is good, because its inserts are cheap), frequency and complexity of queries (if frequent and complex, nested sets is good, because its selects are simple and fast). So for a message board, i'd go with nested sets (or even Tropashkos nested intervals for speed and extra coolness), but for a simple request-response (and sometimes some more response) table, i'd probably use an adjacency list.

wallenborn
Ok. In my example, Item1 and Item2 could have been switch. It would make no difference to the resulting group. Would that make a difference?
cjrh
Probably yes. I would consider a design change under a few circumstances. A few are: 1. if there is an attribute that depends only on both items. In the traveling salesman example this could be the distance between fromCity and toCity, 2. if the number of Items might change in the future. A PathSegment has always exaclty two endpoints and no change request can ever change that, but ymmv, and 3. if the Items are interchangeable (thats what you suggest), because then SELECTs tend to be hard: `select * from LINK where item1 in (4,7) and item2 in (4,7) and item1<>item2` is awkward.
wallenborn
Let's also discuss your (2): the (Item1, Item2) scheme above is being used to group, at least some of the time, >2 times. So there will never be more than 2 fields required, but at the same time, there are a variable number of items in a "group". A group, in this schema, is defined as a complete connected set. Perhaps something like a linked list. Would that also be problematic for you?
cjrh
Sorry, >2 *items*
cjrh
Response to your edit: thanks for the discussion, that was interesting. You haven't convinced me to change my position, but I appreciate the discussion.
cjrh
+1  A: 
Dewayne Christensen
Using your example, what about Relation1, Relation2 (instead of FatherID, MotherID)? Does that still satisfy 1NF?
cjrh
I take your point about same table referral, however. In the situation you have described, will it be implicit that FatherID and MotherID refer to a Person table? Several naming conventions prefer to indicate the table to which a foreign key refers?
cjrh
Normalization and naming are completely separate issues. You can name all your fields Field1, Field2, Field3, and still be in Nth-normal form.In the Person example, father and mother are semantically different things, therefore not a repeating group (as long as we stick to blood relatives). If we add siblings into the mix, however, then the story changes. Having fields Sibling1ID, Sibling2ID, ... constitutes a repeating group.Naming conventions are up to you and your team. If you'd rather use FatherPersonID and MotherPersonID that's your choice, but that doesn't affect normalization.
Dewayne Christensen
Yes, I asked the two questions separately for a reason. The important question was the first one. We could go further: would you enumerate aunts, uncles, cousins, nephews, and so on, as fields or as entries all in the same column, but with a "relation type" specifier?
cjrh
As soon as you strike something that you have multiples of, then you have a repeating group. So for aunts and uncles (and bears, oh my! :) ) you'd want a separate join table. Mother and father would depend on how they're being used. If it's a medical application and you only care about the birth parents, then two fields within the Person table would suffice; if it's an insurance application and you need to know "who's my daddy today" then you'd need to move mom and pop out to a separate table, probably with a date range.
Dewayne Christensen
It is often difficult to know in advance the kind of enhancements that will be required. It is a real pita to have to change schema for enhancement requests. IMO wallenborn's example of a "start" and "end" point are a good example of by-field enumeration, because they can never change, and every line segments requires both, and only those two at all times. I am a little more uncomfortable with roles played by people. This may, however, be paranoia.
cjrh