views:

178

answers:

9

My friend and I are building a website and having a major disagreement. The core of the site is a database of comments about 'people.' Basically people can enter comment and they can enter the person the comment is about. Then viewers can search the database for words that are in the comment or parts of the person name. It is completely user generated. For example, if someone wants to post a comment on a mispelled version of a person's name, they can, and that's OK. So there may be multiple spellings of different people listed as several different entries (some with middle name, some with nickname, some mispelled, etc.), but this is all OK. We don't care if people make comments about random people or imaginary people.

Anyway, the issue is about how we are structuring the database. Right now it is just one table with the comment ID as the primary key, and then there is a field for the 'person' the comment is about:

comment ID - comment - person

1 - "he is weird" - John Smith

2 - "smelly girl" - Jenny

3 - "gay" - John Smith

4 - "owes me $20" - Jennyyyyyyyyy

Everything is working fine. Using the database, I am able to create pages that list all the 'comments' for a particular 'person.' However, he is obsessed that the database isn't normalized. I read up on normalization and learned that he was wrong. The table IS currently normalized, because the comment ID is unique and dictates the 'comment' and the 'person.' Now he is insistant that 'person' should have it's OWN table because it is a 'thing.' I don't think it is necessary, because even though 'person' really is the bigger container (one 'person' can have many 'comments' about them), the database seems to operate just fine with 'person' being an attribute of the comment ID. I use various PHP calls for different SQL selections to make it magically appear more sophisticated on the output and the different way the user can search and see results, but in reality, the set-up is quite simple. I am now letting users rank comments with thumbs up and thumbs down, and I keep a 'score' as another field on the same table.

I feel that there is currently no need to have a separate table for just unique 'person' entries because the 'persons' don't have their own 'score' or any of their own attributes. Only the comments do. My friend is so insistant that it is necessary for efficiency. Finally I said, "OK, if you want me to create a separate table and let 'person' be it's own field, then what would be the second field? Because if a table has just a single column, it seems pointless. I agree that we may later create a need to give 'person' it's own table, but we can deal with that then." He then said that strings can't be primary keys, and that we would convert the 'persons' in the current table to numbers, and the numbers would be the primary key in the new 'person' table. To me this seems unnecessary and it would make the current table harder to read. He also thinks it will be impossible to create the second table later, and that we need to anticipate now that we might need it for something later.

Who is right?

+5  A: 

In my opinion your friend is right.

Person should live in a different table and you should try to normalize. Don't overdo-it, though.

In the long run you may want to do more things with your site, say you want to attach multiple files to a person (ie. pictures) you'll be very thankfull then for the normalization.

Frankie
I agree with Frankie/your friend. Making this kind of change later, while not impossible, is awkward and prone to errors.
Jaydee
Can anybody explain how creating a surrogate key for an attribute that doesn't appear on the left hand side of any functional dependancy normalize a data base? As the OP said, Person determines nothing (and never will). Would you have offered the same advice for an attribute named 'Stuff'? There may be a normalization problem here but it dosen't involve Person.
NealB
@NealB my experience as a teacher and the way the question is asked makes me believe the OP is biased. The simple fact that the field is called person and not text is, IMO, relevant.
Frankie
It is hard to say if the OP has a bias or has just given unfortunate names to some of the attributes in his model. When I see something called Person I natrually expect there to be something more to it than a few random letters, but given his description of the application, Person appears to be just that a tag that determines no other facts. Should that change the argument for creating a surrogate key for it and seperate relation for Person takes on a whole new life.
NealB
+2  A: 

I would vote for your friend. I like to normalize and plan for the future and even if you never need it, this normalization is so easy to do it literally takes no time. You can create a view that you query in order to make your SQL cleaner and eliminate the need for you to join the tables yourself.

RC
+2  A: 

If you have already reached all of your capabilities and have no plans for expansion of capabilities I think you leave it as it is.

If you plan to add more, namely allowing people to have accounts, or anything really, I think it might be smart to separate your data into Person, Comments tables. Its not hard and makes expanding your functionality easier.

hvgotcodes
A: 

Here's the deal. Whenever you create something, you want to make sure that it has room to grow. You want to try to anticipate future projects and future advancements for your program. In this scenario, you're right in saying that there is no need currently to add a persons table that just holds 1 field (not counting the ID, assuming you have an int ID field and a person name). However, in the future, you may want to have other attributes for such people, like first name, last name, email address, date added, etc.

While over-normalizing is certainly harmful, I personally would create another, larger table to hold the person with additional fields so that I can easily add new features in the future.

Aaron
is it really that hard at a later point to just pull all the 'unique' entries from 'person' and make them a column in a new table? and is there a problem with using a string as a primary key?
jason yanofski
Using a string as your PK just makes things more difficult. Plenty DBMSs allow you to use auto-incrementing numeric IDs, so you don't really need to be concerned with the possibility of a mistake happening. It just makes more sense to have a numeric ID as opposed to a string.
Aaron
A: 

Well, there are two schools of thought. One says, create your data model in the most normalized way possible, then de-normalize if you need more efficiency. The other is basically "do the minimum work necessary for the job, then change it as your requirements change". Also known as YAGNI (You aren't going to need it).

It all depends on where you see this going. If this is all it will be, then your approach is probably fine. If you intend to improve it with new features over time, then your friend is right.

Mystere Man
+2  A: 

You're right.

Person may be a thing in general, but not in your model. If you were going to hassle people into properly identifying the person they're talking about, a Person table would be necessary. For example, if the comments were only about persons already registered in the database.

But here it looks like you have an unstructured data, without identity; and that nothing/nobody is interested in making sure whether "jenny" and "jennyyy" are in fact the same person, not to mentionned "jenny doe", and "my cousin"...

pascal
+3  A: 

Creating a new table for person and using the key of that table in place of the person attribute has nothing to do with normalization. It may be a good idea for other reasons but doing so does not make the database "more normalized" than not doing it. So you are right: as far as normalization is concerned, creating another table is unnecessary.

dportas
+1  A: 

Normalization is all about functional dependencies (FD's). You need to identify all of the FD's that exist among the attributes of your data model before it can be fully normalized.

Lets review what you have:

  • Any given instance of a CommentId functionally determines the Person (FD: CommentId -> Person)
  • Any given instance of a CommentId functionally determines the Comment (FD: CommentId -> Comment)
  • Any given instance of a CommentId functionally determines the UserId (FD: CommentId -> UserId)
  • Any given instance of a CommentId functionally determines the Score (FD: CommentId -> Score)

Everything here is a dependant attribute on CommentId and CommentId alone. This might lead you to the belief that a relation (table) containing all of, or a subset of, the above attributes must be normalized.

First thing to ask yourself is why did you create the CommentId attribute anyway? Strictly speaking, this is a manufactured attribute - it does not relate to anything 'real'. CommentId is commonly referred to as a surrogate key. A surrogate key is just a made up value that stands in for a unique value set corresponding to some other group of attributes. So what group of attributes is CommentId a surrogate for? We can figure that out by asking the following questions and adding new FD's to the model:

  • 1) Does a Comment have to be unique? If so the FD: Comment -> CommentId must be true.
  • 2) Can the same Comment be made multiple times as long as it is about a different Person? If so, then FD: Person + Comment -> CommentId must be true and the FD in 1 above is false.
  • 3) Can the same Comment be made multiple times about the same Person provided it was made by different UserId's? If so, the FDs in 1 and 2 cannot be true but FD: Person + Comment + UserId -> CommentId may be true.
  • 4) Can the same Comment be made multiple times about the same Person by the same UserId but have different Scores? This implies FD: Person + Comment + UserId' + Score -> CommentId is true and the others are false.

Exactly one of the above 4 FD's above must be true. Whichever it is affects how your data model is normalized.

Suppose FD: Person + Comment + UserId -> CommentId turns out to be true. The logical consequences are that:

  • Person + Comment + UserId and CommentId serve as equivalent keys with respect to Score
  • Score should be put in a relation with one but not both of its keys (to avoid transitive dependencies). The obvious choice would be CommentId since it was specifically created as a surrogate.
  • A relation comprised of: CommentId, Person, Comment, UserId is needed to tie the Key to its surrogate.

From a theoretical point of view, the surrogate key CommentId is not required to make your data model or database work. However, its presence may affect how relations are constructed.

Creation of surrogate keys is a practical issue of some importance. Consider what might happen if you choose to not use a surrogate key but the full attribute set Person + Comment + UserId in its place, especially if it was required on multiple tables as a foreign or primary key:

  • Comment might add a lot of space overhead to your database because it is repeated in multiple tables. It is probably more than a couple of characters long.
  • What happens if someone chooses to edit a Comment? That change needs to be propagated to all tables where Comment is part of a key. Not a pretty sight!
  • Indexing long complex keys can take a lot of space and/or make for slow update performance

The value assigned to a surrogate key never changes, no matter what you do to the values associated to the attributes that it determines. Updating the dependant attributes is now limited to the one table defining the surrogate key. This is of huge practical significance.

Now back to whether you should be creating a surrogate for Person. Does Person live on the left hand side of many, or any, FDs? If it does, its value will propogate through your database and there is a case for creating a surrogate for it. Whether Person is a text or numeric attribute is irrelevant to the choice of creating a surrogate key.

Based on what you have said, there is at best a weak argument to create a surrogate for Person. This argument is based on the suspicion that its value may at some point become a key or part of a key at some point in the future.

NealB
+1  A: 

If you never intend to associate the person column with a user or anything else and data apparently needs no consistency or data integrity checks, just why is this in a relational database at all? Wouldn't this be a use case for a nosql database? Or am I missing something?

HLGEM