tags:

views:

75

answers:

5

I have a "items" table with 1 million rows and a "users" table with 20,000 rows. When I select from the "items" table I do a join on the "users" table (items.user_id = user.id), so that I can grab the "username" from the users table.

I'm considering adding a username column to the items table and removing the join. Can I expect a decent performance increase from this? It's already quite fast, but it would be nice to decrease my load (which is pretty high).

The downside is that if the user changes their username, items will still reflect their old username, but this is okay with me if I can expect a decent performance increase.

I'm asking stackoverflow because benchmarks aren't telling me too much. Both queries finish very quickly. Regardless, I'm wondering if removing the join would lighten load on the database to any significant degree.

Example query with join:
SELECT Item.id, Item.submitter_id, Item.source_image, Item.cached_image, Item.source_title, Item.source_url, Item.width, Item.height, Item.status, Item.popular, Item.made_popular, Item.fave_count, Item.tags, Item.user_art, Item.nudity, Item.created, Item.modified, Item.removed, Item.nofront, Item.test, Item.recs, Item.recs_data, User.id, User.username, User.password, User.email, User.fullname, User.profileurl, User.homepage, User.bio, User.location, User.avatar, User.ff_user, User.ff_key, User.ff_last_faveid, User.twitter_user, User.twitter_pass, User.emailalerts, User.showunsafe, User.view, User.fb_uid, User.fb_session, User.fb_avatar, User.twitter_uid, User.twitter_data, User.twitter_autopost, User.uri, User.created, User.modified FROM items AS Item LEFT JOIN users AS User ON (Item.submitter_id = User.id) WHERE Item.nofront != 1 AND Item.removed != 1 AND Item.made_popular is not NULL AND nudity != 1 ORDER BY Item.made_popular DESC LIMIT 1040, 290;

Example query without join:
SELECT Item.id, Item.submitter_id, Item.source_image, Item.cached_image, Item.source_title, Item.source_url, Item.width, Item.height, Item.status, Item.popular, Item.made_popular, Item.fave_count, Item.tags, Item.user_art, Item.nudity, Item.created, Item.modified, Item.removed, Item.nofront, Item.test, Item.recs, Item.recs_data FROM items AS Item WHERE Item.nofront != 1 AND Item.removed != 1 AND Item.made_popular is not NULL AND nudity != 1 ORDER BY Item.made_popular DESC LIMIT 1040, 290;

A: 

You will only see a significant performance increase if you were missing an index on either items.user_id or user.id, or if you're using a crappy database. Otherwise, performance won't improve significantly.

jemfinch
A: 

JOINS always take up more resources than simple SELECT statements. So yes, removing the JOIN should increase performance.

jfar
+1  A: 

I suggest that you keep it that way to preserve a normalized table. I thinks its not a good idea to put usernames on the item table since it will make data redundant. Have you tried re-indexing your table?

hallie
+4  A: 

The right answer is to measure it, in the target environment, to see if it makes a difference. Then do a cost/benefit analysis to see if it's worth it.

The cost is the added storage and possibility of data getting out of sync (but see below on how to mitigate this). The benefit is the increased speed or reduced load.

Database schemas are not set-and-forget operations, they should be tuned periodically as the underlying data changes. That's what DBAs are paid for, continual monitoring and tuning.

In any case, the duplication of a column can be controlled quite easily in a decent DBMS by the use of triggers. By that, I mean place an insert/update trigger on the users table so that, if a user changes their user name, it's also changed in the items table (and possibly vice versa).

Whether MySQL meets my definition of a decent DBMS, I can't comment - I'm a DB2 bod myself. But reversion from third normal form is a tried and tested technique for wringing every last ounce of performance out of databases and, provided you understand the consequences, is quite acceptable. Very few people complain about their databases taking up too much disk space. Many complain about how slow their queries run.

Just keep in mind that reversion is something you do if and when you have a performance problem. It's not something that should be done just because you think it may reduce the load. Unless the load (or time taken) is actually a problem, the benefit part of your cost/benefit analysis is zero, so any sane bean counter will tell you that means "no change".


Based on your added queries, I have a couple of points to make:

  • First, the nudity column. Please tell me how I can get access to this database :-)
  • You should only extract the columns you need. If the user name is all you require from the User table, you should not be getting all that extra stuff in the first query. Possibly likewise for the Item stuff - only get what you need.
  • Make sure you have indexes on all columns used in WHERE clauses - this may also entail combination indexes (those with more than one column). What gets indexed depends on your queries but each column used in the WHERE clauses is a good start for analysis.
  • For large tables, you can consider "sweeping" removed items periodically into a separate table (e.g., RemovedItems) to minimise the size of Items and speed up the queries. But keep in mind this is only useful if you rarely need to look for romoved items, since it will complicate those queries (by forcing them to search in two tables instead of one). Again, this is a cost/benefit thing. One million rows is not really that big a table (at least in my world).
paxdiablo
Thanks for your advice. Good point about only extracting the columns I need. I'm going through all my queries to make sure that I'm only getting what I need. I've made sure all my index are good. By "sweeping" do you mean deleting columns? I've heard it's better to leave them in there..
makeee
By "sweeping" he means deleting (or moving into an "archive" table) rows which are no longer active or you expect to access less often than others (partitioning.)
vladr
I don't need the "removed" rows, but I remember hearing that deleting rows can slow down lookups / cause other issues. Is this not true?
makeee
Deleting rows can cause issues if you need the data at some point, but that appears to not be the case here. The deletion process can be time consuming enough that Oracle introduced a soft-delete feature - in their implementation, the row is marked as gone but not actually physically removed from the table (same as what you're doing but as part of the DBMS itself so there's no way to get at the data). It then allows you to _physically_ delete all the logically-deleted records at a quiet time (e.g., midnight). I would suggest that you do your sweeping at a quiet time as well.
paxdiablo
paxdiablo: Thanks for the info. Items are rarely removed. Would having "removed = 0" in my query degrade performance for reasons besides the extra rows it has to pass over (which probably isn't many)? If the only issue is a few hundred extra rows that the query has to skip over it seems it's probably not worth doing any sweeping.
makeee
If you're talking hundreds of rows, don't waste your time. That size is hardly even worthy of a database :-) If your removed column is indexed, the difference between checking for NULL and checking for 0 will probably not be noticeable but again, you should measure, not guess, and certainly don't take what some random guy on the internet says as gospel :-)
paxdiablo
A: 

I have a "items" table with 1 million rows and a "users" table with 20,000 rows.

That is, independently of whether you JOIN or denormalize, you would still transfer roughly 1M/20k = 50 times more User information over the wire than strictly necessary. Encoding, transferring and then decoding data adds load.

I'm considering adding a username column to the items table and removing the join.

Why are you then, in your original JOIN, also bringing over all this other (potentially voluminous) information (such as User.profileurl, User.homepage etc.) if all you need is the user name? Remember, for User columns, you are transferring on average 50 copies of each bit of information. Have you considered drastically trimming down the columns you are SELECTing from in the JOIN (both from the User as well as the Item tables?)

I'm asking stackoverflow because benchmarks aren't telling me too much. Both queries finish very quickly. Regardless, I'm wondering if removing the join would lighten load on the database to any significant degree.

In a first phase, removing columns you do not intend to use can reduce load, as less data has to be encoded, transferred (from server to client application) then decoded.

In a second phase, let me start with a question of my own: do you really need all million rows in one shot? If you do not, e.g. if you are user-interface driven and you paginate them (using OFFSET ... LIMIT ...), then you will not necessarily care about the 50x User information duplication (unless the LIMIT gets into the tens of thousands.) Otherwise, you may want to measure the advantage of eliminating the 50x duplication by first SELECTing only User.id and User.username into application memory (20k pairs, into a hash-table/map), then SELECTing only Item rows (1M iterations) everytime resolving, at the application level, Item.user_id against the hash-table/map.

Of course, always use EXPLAIN to ensure that the proper indices exist and are being used when an index should be used, and run ANALYZE TABLE after any of your tables grow from under a few hundred rows to thousands or millions.

Cheers, V.

vladr
You are correct about there being extra user info. I didn't realize that was such a big deal, but it makes sense in retrospect. I'll start with trimming that down.
makeee