views:

234

answers:

11

I have two tables, like these:

Table People:
VARCHAR Name
INTEGER Age

Table Message
VARCHAR Message
VARCHAR Name

There are hundreds of insert and delete going on with the Message table with queries like this:

insert into Message VALUES ('Hello there', 'John');
delete from Message where name = 'John';

My question is, is it worth while to add an ID field in People and reference the User as an id in Message? Will the following queries be much faster?

FIRST - select id from User where name = 'John'
THEN - delete from Message where id = $id

I thought at first inserts and deletes by searching for an id will be faster than by searching by characters. But how much faster? Will the speed incresase offset the performance hit from needing an extra query to the User table to find the id?

+1  A: 

If the most of the names are short (not 15 to 20 characters long), and the table is indexed properly, then the speed performance you are going to receive from an id field is going to be negligible.

Kevin
+2  A: 

Is it faster? Only profiling will tell, however . . .

It IS better practise to put an id column on Person, and put a foreign key constraint from Message to Person on Id (that is assuming all messages can only go to people in Person table).

You can still delete a message in one statement

delete from Message where id IN (select id from Person where Name = 'John')

and the database will optimise this so it is far faster than the two statements (i.e. faster than separate select & delete statements)

You can specify cascading deletes on your foreign key constraint, so that when you delete a Person, all messages to the person are automatically deleted too.

See here for more on Foreign Keys

Binary Worrier
I think "id = (...)" should be "id IN (...)", although it might depend on the flavor of SQL in use.
Carl Manaster
@Carl: Good point
Binary Worrier
+4  A: 

As you said, the extra query will make it SLIGHTLY slower (of course there are dependences on the name's length, database type, etc.)

But what happens when the user changes his name, when you want to delete an user, etc?. That sort of design will give you lots of pain. It's better to normalize, regardless of this tiny performance issue,

tekBlues
+1 for "speed isn't everything" but a sound design really pays off in the long run!
marc_s
+1  A: 

You shouldn't need to do an extra query. You can do something like this:

DELETE FROM Message 
INNER JOIN User 
  ON Message.id = User.id 
WHERE User.name = 'John'
Bernard Chen
if he's using SQL Server, he'll need to break out Microsoft's spiffy double FROM syntax.
David B
Will this join have a huge impact on performancE?
erotsppa
+3  A: 

A person's name is never a good Primary Key as names are NOT unique. And they change over time. FAr better to use a surrogate key (and yes a join on an Int is often faster and you can join in delete staments in many databases not use the slower subquery) especially since names tend to be longer than a few characters.

HLGEM
You have a point, but: names can be unique, if the database enforces the uniqueness - e.g. "name" here might just be their login name, not necessarily their birth name. The question does not elaborate on the schema design, probably because he's more concerned about performance of lookup by numbers vs. strings.
Jeffrey Kemp
+1  A: 

In my experience, the user table of a website backend is one of those tables that pretty much stays in memory 100% of the time. It is pretty central to any activity so it never falls out of the page buffers. So I would (and have) definitely go the route of using the userId for all references like this.

A: 

To your exact question, with such a small schema, to dump the contents of the original Messages table, the denormalized will be faster. The query plan will be smaller and easier to optimize and there will be no join overhead.

In general, it's much, much more complicated.

Whether it's the right thing to do is a question. For that, start with a normalized design but be willing and prepared to denormalize if there's a compelling reason to do so. There are sometimes legit reasons for denormalization, though usually the gains of normalized data offset any performance loss.

Normalized data is easier to maintain and is generally more flexible. For flexibility, having a numeric pkey lets you have multiple people named the same name. You can add more fields to People easily. It's easier to run a report to see all the people in the system without scanning all Messages.

But performance may be a factor. Given the data in the two tables, the database has several options on how to join. It may use either People or Messages as the base table, and how the join is done will affect things (nested loops, hash joins, sort/merge, etc).

But on top of that, normalized can actually be faster. What if your schema is more complicated than you describe? Let's say your People table has 50 fields of HR-related stuff and your Messages table has only a single 20-character message field. If you have a case of two people but 100k messages, denormalized will actually be faster. This is because I/O is the biggest limiting factor of databases. If you were to dump all data in one query, the normalized data will fetch those 50 fields only once and your Messages table will be densely packed with data. In the denormalized version, each row of Messages will contain 51 fields and you'll drastically increase the number of I/Os to get the same result.

Matt
A: 

You shouldn't need to worry about optimization until you've got a good design.

Right now, I think it's hard to say if this is an unrealistic toy problem, but typically, you are going to design with a primary key and foreign key relationship in mind, and this key is very unlikely to be a varchar.

It might be a GUID or it might be an int, but either way, you're going to have at least a non-clustered index on it, perhaps a clustered index on it, and you'll have a non-clustered index on important data like the username, so ultimately performance of your entire system (and not just this one delete) is going to depend on a solid normalized design and good indexing strategy for these tables.

Cade Roux
-1 most of my lookup tables have a PKs on a VARCHAR column which are referenced via FKs.
onedaywhen
The majority of my lookup tables are int surrrogates or char columns, and yes, some lookup tables will have varchar. I stand by my contention that for most application entities (i.e. no natural keys), you are extremely unlikely to have a varchar primary key.
Cade Roux
"no natural keys" is an assumption too far.
onedaywhen
+3  A: 

Your design is already normalized, assuming you have a unique constraint on People.Name, and there is a referential integrity constraint between Message.Name and People.Name.

It's not so much a problem of normalization - IF you want to allow people to change their names, you have a performance and scaleability issue (needing to update all associated rows in the Message table).

If people never change their names in your system, then it's not a problem. In this case, Name is pretty much as good as an ID - although some DBMS's might perform better with an indexed number rather than an indexed string (?).

The performance of the delete is another matter - I'd say if you have a unique name already, then deleting by name will be quicker than having to do a lookup (or join) to delete by ID - but again you'll want to do your own benchmarking.

Jeffrey Kemp
A: 

It is all about IO and maintainability. If your varchar contain less than 4 bytes than your query will be faster if you use the varchar and not the integer. But it is not a big improvement, and you loose a lot performance if you ever need to change the name! Every row in message table need to be updated (eg delete and insert).

If you use an integer it is only 4 bytes stored for the reference to the user table. And if you use an covering index on ID and name in the user table then it is just a slight overhead to think of. The user table will probably stay in the cache, depending on how many rows you expect in the user table and how much memory you have. In that case you will have a logical scan instead of a much slower physical scan.

Hakan Winther
A: 

A side note: don't forget to put indexes on the columns that you are joining your tables on, if they aren't there already.

Jon