views:

46

answers:

2

Please take into consideration this is a MySQL Question for Web Development.

Currently I'm designing the database structure for a User Authentication System and I came across one question, that I myself can't figure it out:

Is it better to have duplicated data instead of making more queries?

Here's a little background, currently my users table looks something like this (pseudo-code):

id mediumint
username varchar(15)
password varchar(100)
email varchar(80)
status tinyint(1) <- is the user banned?
language varchar(100)
private_message_counter mediumint
notify_email tinyint(1)
Extra rows

I'm trying to put all the "most used" rows into the users table, to prevent more queries for example:

With Indicator on users table:
-
User Logged on? (query Sessions)
Get User Data (query Users)
 Get User Permissions (query permissions)
-

Without indicators:
-
User Logged on? (query Sessions)
Is the user Banned? (query Bans)
Get User Data (query Users)
Get User Permissions (query Permissions)
Get Private Message information (query private_messages table)

One little "problem" is that the users table ends with a lot of rows. It's obvious also that I'll need to run more checks to prevent data mismatch, but isn't the improvement way better?

Note: My Website has around 14,500 simultaneous users connected. So I need to know if it'll improve or do the complete opposite.

Any opinions or recommendations are welcomed.

+2  A: 

Very, very very rarely is the right answer data duplication. We talk about normalization for a reason so often.

Typically you only duplicate data in an offline data-warehouse situation where you're dealing with 10s+ of millions of rows and the processing time for aggregation is too long. In an online system to risk of data falling out of sync is almost always too great for any perceived gains from duplicating data. A few extra queries will not kill you.

Donnie
The Online service I'm talking about is "kind of" huge. About 14,500 simultaneous users logged in. ¿Won't the load of making extra queries turn into a big pain on the long run?
kuroir
Not if your database is properly spec'd and set up for the expected load. If you're trying to manage load by breaking your schema I hate to say, but you're doing it wrong.
Donnie
Yeah I though about it. Having load balancing and multiple servers to handle this stuff. Right?
kuroir
And making sure that individual servers are up to their portion of the load. MySQL setups especially seem to suffer from poorly planned disk subsystems, and that's arguably the most important part of a DB.
Donnie
I disagree (in this case). It's not a good idea to check if the user is banned by querying another table. Normalization is an important thing, but for cases like this (posts, banned or not, PM's, etc) I make an exception.
Bart S.
Stuff like an `IsBanned` flag wouldn't really be duplication of data, since that is a property of the user. Permissions are complicated enough that they need to be in another table, and I'd be careful about keeping thinigs like PM counts, because there you are duplicating data. etc, etc.
Donnie
The above two comments make a good point. Overnormalization is sometimes worse than undernormilazation because you make your database too hard to use. The trick is to do it *just right* :)
Donnie
Then, how about moving most used/displayed DATA to the users table. To the users table? This would prevent to actually calling other tables if there's no real need.Regarding the private message count. How about actually just having that as a counter? and check them when the user enters to the "private message" panel?
kuroir
A: 

Is there an actual performance issue that a clever UNION statement doesn't get around?

I wouldn't overload tables to gain short term performance. You'll sacrifice your uptime the bigger your tables get, (happened to me). You might need multiple layers of caching in your application. (Some combination of memcached for banned state and materialized views for sessions+permissions maybe?)

I suggest running tests to see what your results become after scaling an order of magnitude the number of rows of the data in your tables, and 100,000 simultaneous users. Your architecture might benefit by partitioning tables between current and frequent users and less frequent users. Or follow the notion of having servers that deal with sessions, and servers that deal with canonical data.

In my project, only about 0.01% of my users are logged in at one time. If you have 1% of your users logged in, you're scaling into the million-row range. I would suggest considering how to maintain your uptime requirements and some basic performance requirements. Table repairs, optimizations, backups: these operations won't be cheap and are tricky in a multi-master architecture. (Thought about table partitioning?)

Update (and repair) operations are less expensive when performed on smaller tables. Not only are they less likely to drop large groups of cached queries out of the mysql query cache, they also maintain better key cache performance. If your users table is updated frequently, you should separate the frequently updated columns from the uncommonly updated columns. Your key cache hit rate will improve and so will your query cache hit rate.

If you're actually planning on growing this application, you have to deal with more more queries every day, no matter what. If your application suffers when the query rate merely doubles, something is wrong. In my experience, duplicating data into the users table, (primary to your data model) is going to make it harder to enforce the use other data tables--and that might be very hard to get away from.

memnoch_proxy