views:

1092

answers:

12

If I have a Tables:

Projects(projectID, CreatedByID)
Employees(empID,depID)
Departments(depID,OfficeID)
Offices(officeID)

("CreatedByID" is a foreign key for employees)

And I have a query that I need to run for almost every request of a web application that grabs all projects in an office. Is it bad practice to just add a redundant "OfficeID" column to projects to eliminate the 3 joins? Or should I do the following?

SELECT * 
FROM Projects P
JOIN Employees E ON P.CreatedBY = E.EmpID
JOIN Departments D on E.DepID = D.DepID
JOIN Offices O on D.officeID = O.officeID
WHERE O.officeID = @SomeOfficeID

Until I notice performance problems?

In application programming I always follow the rule "Write with best practices first and optimize afterwards" but when it comes to database design and normalization like this I get concerned because DBAs are always warning about the cost of joins.

+5  A: 

The cost of joins shouldn't worry you too much per se (unless you're trying to scale to millions of users, in which case you absolutely should worry).

I'd be more concerned about the effect on the code that's calling this. Normalized databases are much easier to program against, and almost always lead to better efficiency within the application itself.

That said, don't normalize beyond the bounds of reason. I've seen normalization for normalization's sake, which usually ends up in a database that has one or two tables of actual data, and 20 tables filled with nothing but foreign keys. That's clearly overkill. The rule I normally use is: If the data in a column would otherwise be duplicated, it should be normalized.

DannySmurf
+4  A: 

DBA's should be concerned if your db is not properly normalized to begin with. After you carefully measured performance and determined you have bottlenecks you may start denormalizing, but I would be extremely cautious.

Otávio Décio
+17  A: 

Denormalization has advantage of fast SELECT on large queries.

Disadvantages are:

  • It takes more coding and time to ensure integrity (which is most important in your case)

  • It's slower on DML (INSERT/UPDATE/DELETE)

  • It takes more space

As for optimization, you may optimize either for faster querying or for faster DML (as a rule, these two are antagonists).

Optimizing for faster querying often implies duplicating data, be it denormalization, indices, extra tables of whatever.

In case of indices, RDBMS does it for you, but in case of denormalization, you'll need to code it yourself. What if Department moves to another Office? You'll need to fix it in 3 tables instead of one.

So, as I can see from the names of your tables, there won't be millions records there. So you'd better normalize your data, it will be simplier to manage.

Quassnoi
I think you meant to say "It's slower on DML (INSERT/UPDATE/DELETE)"
John
Sure I did, thanks.
Quassnoi
+11  A: 

Normalize till it hurts, then denormalize till it works

SQLMenace
+2  A: 

If you're using Integers (or BIGINT) as the ID's and they are the clustered primary key you should be fine.

Although it seems like it would always be faster to find an office from a project as you are always looking up primary keys the use of indexes on the foreign keys will make the difference minimal as the indexes will cover the primary keys too.

If you ever find a need later on to denormalise the data, you can create a cache table on a schedule or trigger.

John
The IDs don't necessarily need to be clustered for the best possible speed. Since these would all be seek lookups rather than scans it shouldn't make a difference when just traversing the FKs.
Tom H.
+7  A: 

Always normalize as far as necessary to remove database integrity issues (i.e. potential duplicated or missing data).

Even if there were performance gains from denormalizing (which is usually not the case), the price of losing data integrity is too high to justify.

Just ask anyone who has had to work on fixing all the obscure issues from a legacy database whether they would prefer good data or insignificant (if any) speed increases.

Also, as mentioned by John - if you do end up needing denormalised data (for speed/reporting/etc) then create it in a separate table, preserving the raw data.

Peter Boughton
+2  A: 

Normalize to model the concepts in your design, and their relationship. Think of what relationships can change, and what a change like that will mean in terms of your design.

In the schema you posted, there is what looks to me like a glaring error (which may not be an error if you have a special case in terms of how your organization works) -- there is an implicit assumption that every department is in exactly one office, and that all the employees who are in the same department work at that office.

What if the department occupies two offices?

What if an employee nominally belongs to one department, but works out of a different office (assuming you are referring to physical offices)?

SquareCog
+1  A: 

In the example given indexes set up properly on the tables should allow the joins to occur extremely fast and will scale well to the 100,000s of rows. This is usually the approach that I take to get around the issue.

There are times though that the data is written once and the selected for the rest of its life where it really didn't make sense to do a dozen joins each time.

DavGarcia
+2  A: 

Keep that schema in Third Normal Form and let your DBA to complain about joins cost.

Roberto Russo
+2  A: 

I'd be most concerned about DBAs who are warning you about the cost of joins, unless you're in a highly pathological situation.

le dorfier
+3  A: 

You shouldn't look at denormalizing before you've tried everything else.

Is the performance of this really an issue? Do your database have any features you can use to speed things up without compromising integrity? Can you increase your performance by caching?

John Nilsson
+1  A: 

Don't denormalize.

Design your tables according to simple and sound design principles that will make it easy to implement the rest of your system. Easy to build, populate, use, and administer the database. Easy and fast to run queries and updates against. Easy to revise and extend the table design when the situation calls for it, and unnecessary to do so for light and transient reasons.

One set of design principles is normalization. Normalization leads to tables that are easy and fast to update (including inserts and deletes). Normalization obviates update anomalies, and obviates the possiblity of a database that contradicts itself. This prevents a whole lot of bugs by making them impossible. It also prevents a whole lot of update bottlenecks by making them unnecessary. This is good.

There are other sets of design principles. They lead to table designs that are less than fully normalized. But that isn't "denormalization". It's just a different design, somewhat incompatible with normalization.

One set of design principles that leads to a radically different design from normalization is star schema design. Star schema is very fast for queries. Even large scale joins and aggregations can be done in a reasonable time, given a good DBMS, good physical design, and enough hardware to get the job done. As you might expect, a star schema suffers update anomalies. You have to program around these anomalies when you keep the database up to date. You will will generally need a tightly controlled and carefully built ETL process that updates the star schema from other (perhaps normalized) data sources.

Using data stored in a star schema is dramatically easy. It's so easy that using some kind of OLAP and reporting engine, you can get all the information needed without writing any code, and without sacrificing performance too much.

It takes good and somewhat deep data analysis to design a good normalized schema. Errors and omissions in data analysis may result in undiscovered functional dependencies. These undiscovered FDs will result in unwitting departures from normalization.

It also takes good and somewhat deep data analysis to design and build a good star schema. Errors and ommissions in data analysis may result in unfortunate choices in dimensions and granularity. This will make ETL almost impossible to build, and/or make the information carrying capacity of the star inadequate for the emerging needs.

Good and somewhat deep data analysis should not be an excuse for analysis paralysis. The analysis has to be right and reasonably complete in a short amount of time. Shorter for smaller projects. The design and implementation should be able to survive some late additions and corrections to the data analysis and to the requirements, but not a steady torrent of requirements revisions.

This response expands on your original question, but I think it's relevant for the would be database designer.

Walter Mitty