views:

96

answers:

3

In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.

Should I denormalize the user_id and include it in every table, for faster performance?


Here's one example:

  • Address belongs to user, and has a user_id
  • Envelope belongs to user, and has a user_id
  • AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).

One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.


Here's a different scenario:

  • Letter belongs to user, and has a user_id
  • Recepient belongs to Letter, and has a letter_id
  • RecepientOption belongs to Recepient, and has a recepient_id

Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?


Some notes:

  • There are never any objects that are shared between users. An entire hierarchy of related objects always belongs to the same user.
  • The user owner of objects never changes.
  • Database performance is important because it's a data intensive application. There are many queries and many tables.


So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?

+1  A: 

As long as you

a) get a measurable performance improvement

and

b) know which parts of your database are real normalized data and which are redundant improvements

there is no reason not to do it!

David Schmitt
Cool! Good to hear there's nothing glaringly wrong about it. Thanks.
Allan Grant
+1  A: 

Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.

I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.

Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Juha Syrjälä
I second that. 500,000 isn't that much. Do you have an idea how fast things should be, and by what amount you want to increase performance?
Roland Bouman
Note, it's 500,000 records per user, not 500,000 records total. Number of total users should be able to grow to at least 100,000 without scaling problems, although the concurrent users will likely be much smaller (less than 1% of total). So with 100K active users and 500K records per user, that's 50,000,000,000 records. That's why I think partitioning by user_id may be helpful eventually. There is no measured performance problem yet. I'm just wondering if it would be a good move _hypothetically_ to partition every table with user_id and use it as the first item in each composite index.
Allan Grant
+2  A: 

I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:

user(
    #user_id
)
address(
    #user_id
,   #addres_num
)
envelope(
    #user_id
,   #envelope_num
)
address_envelope(
    #user_id
,   #addres_num
,   #envelope_num
)

(the # indicates a primary key column)

I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)

Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)

At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.

Roland Bouman
Thanks, Roland. That's exactly what I was thinking of doing. Perhaps denormalize was the wrong word to use, since I'm not actually denormalizing data from within the user table into other tables, just including the user_id as a key in a table where it's possible to get at the user_id through another key (like in the address_envelopes example, where you could get the user_id either through he address or the envelope). Clustered indexing and partitioning data between tables/machines by user sounds like a great idea!
Allan Grant