views:

72

answers:

4

It's commonly understood that in the relational model:

  1. Every relational operation should yield a relation.
  2. Relations, being sets, cannot contain duplicate rows.

Imagine a 'USERS' relation that contains the following data.

ID FIRST_NAME LAST_NAME
 1 Mark       Stone
 2 Jane       Stone
 3 Michael    Stone

If someone runs a query select LAST_NAME from USERS, a typical database will return:

LAST_NAME
Stone
Stone
Stone

Since this is not a relation - because it contains duplicate rows - what should an ideal RDBMS return?

+1  A: 

I'm not sure I see a problem with the returned values. There are three records that contain "Stone" for LAST_NAME. This would have been obvious if FIRST_NAME or ID had been included in the query, but it was not. Usually, the DISTINCT keyword is used to handle this and ensure that there will be no duplicates.

In fact, if my database started applying DISTINCT automatically (which it sounds like you think maybe it should), I'd be somewhat annoyed. Seeing duplicate rows when you don't expect to is often the needed break when debugging some weird data problem in a database.

FrustratedWithFormsDesigner
The other answer is the right one, in my opinion. In a true relation, there can be no duplicates.
Jason Swett
@Jason Swett: Well then, I guess I'm just not into "true" relationships. If I had a girlfriend, she'd probably agree with me :(
FrustratedWithFormsDesigner
+1  A: 

In a RDBMS a relational projection on the last name column alone would return only a set of tuples with distinct values of last name. There would be no duplicate tuples.

In SQL it is true that you would get duplicates unless you specified the DISTINCT keyword. That's because SQL is not a truly relational language - among other things because SQL tables and table expressions are not proper relations. A SQL DBMS is not a RDBMS.

dportas
Do you know any authoritative reference that confirms this?
Seun Osewa
That a projection is a relation with distinct tuples? E.F.Codd's classic 1970 paper "A Relational Model of Data for Large Shared Data Banks". The Alice Book (http://www.amazon.com/Foundations-Databases-Logical-Serge-Abiteboul/dp/0201537710). David Maier's Theory of Relational Databases. Those three contain pretty much everything you need to know about the basics of the relational model.
dportas
Database In Depth by Chris Date says the same stuff. Good book.
Jason Swett
I feel a bit disappointed that some information is lost as a result of the de-duplication. Namely, that there are 3 different users with that last_name.
Seun Osewa
By projecting only one attribute you are of course excluding other information. You could add an attribute with a count of each name if that's what you required. The point is that all such information is expressed as values within tuples within relations.
dportas
+1  A: 

"what should an ideal RDBMS return?"

As David indicated, it should return (in your example) one single row.

An SQL DBMS is only a relational one if it treats every SELECT as if SELECT DISTINCT were requested. (But there are a few tiny additional conditions to be met too.)

The reason this is so is that the "meaning" of that single row is as follows : "There exists some user such that he has a first_name, he has an ID, and his last_name is 'Stone'".

There is never any logical need to repeat that statement a second time. The authoritative reference that you asked for, is Ted Codd himself : "If something is true, then saying it twice won't make it any truer.".

Erwin Smout
But some information is lost - that there are 3 users with that last name.
Seun Osewa
+1  A: 

"But some information is lost - that there are 3 users with that last name."

If the count of users with that name is what you are interested in, then the query of your example is not the question you should be asking.

The query of your example will provide the answer to the question "What are all the last names such that there exists a user that has that last name ?".

If the question you want to ask is "how many users are there that are named 'Stone'", then the query you should submit is Select count(...) from users where last_name = 'Stone';

Projection always "looses" information. The information that is tied to the attributes that are projected away. I don't see how a known property of a useful relational operator can be explained as an argument against that operator.

Erwin Smout