ansaurus

Question

To normalize or not to normalize user_ids

Answer 1

+1 A:

As long as you

a) get a measurable performance improvement

and

b) know which parts of your database are real normalized data and which are redundant improvements

there is no reason not to do it!

David Schmitt 2009-12-24 10:28:29

Cool! Good to hear there's nothing glaringly wrong about it. Thanks.

Allan Grant 2009-12-26 21:24:05

Answer 2

+1 A:

Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.

I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.

Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Juha Syrjälä 2009-12-24 10:37:41

I second that. 500,000 isn't that much. Do you have an idea how fast things should be, and by what amount you want to increase performance?

Roland Bouman 2009-12-24 10:41:02

Note, it's 500,000 records per user, not 500,000 records total. Number of total users should be able to grow to at least 100,000 without scaling problems, although the concurrent users will likely be much smaller (less than 1% of total). So with 100K active users and 500K records per user, that's 50,000,000,000 records. That's why I think partitioning by user_id may be helpful eventually. There is no measured performance problem yet. I'm just wondering if it would be a good move _hypothetically_ to partition every table with user_id and use it as the first item in each composite index.

Allan Grant 2009-12-26 21:22:09

Answer 3

+2 A:

I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:

user(
    #user_id
)
address(
    #user_id
,   #addres_num
)
envelope(
    #user_id
,   #envelope_num
)
address_envelope(
    #user_id
,   #addres_num
,   #envelope_num
)

(the # indicates a primary key column)

I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)

Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)

At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.

Roland Bouman 2009-12-24 10:39:40

Thanks, Roland. That's exactly what I was thinking of doing. Perhaps denormalize was the wrong word to use, since I'm not actually denormalizing data from within the user table into other tables, just including the user_id as a key in a table where it's possible to get at the user_id through another key (like in the address_envelopes example, where you could get the user_id either through he address or the envelope). Clustered indexing and partitioning data between tables/machines by user sounds like a great idea!

Allan Grant 2009-12-26 21:27:05

ansaurus

tags:

views:

answers:

To normalize or not to normalize user_ids

related questions