views:

85

answers:

2

Assume we have a popular site. We need to implement mail-like messaging between users. Typical solution is to use 2 tables:

Users (user_id)

Messages (message_id, sender_id (references user_id), receiver_id (references user_id), subject, body ).

This method has 2 significant limitations

  1. All messages of all users are stored in one table leading to it's high load and decreasing overall database performance.
  2. When someone needs to send message to several users simultaneously, the message gets copied (recipients_count) times.

The other solution uses 3 tables:

Users(user_id)

Sent_messages(sent_id, sender_id (references user_id), subject, body)

Received_messages(sent_id, receiver_id (references user_id), subject, body)

subject and body of received_messages are copied from corresponding fields of sent_messages.

This method leads to

  1. Denormalizing the database by copying information from one table to another
  2. Users can actually delete sent/received messages without removing them from the receivers/senders.
  3. Messages take approximately 2 times more space
  4. Each table is loaded approximately 2 times less.

So here go the questions:

  1. Which one of considered design is better for high load and scalability? (I think it's the second one)
  2. Is there another database design that can handle high load? What is it? What are the limitations?

Thanks!

P.S. I understand that before getting to these scalability issues the site has to be very successful, but I want to know what to do if I need to.

UPDATE

Currently for the first versions I'll be using design proposed by Daniel Vassallo. But if everything is OK in the future, the design will be changed to the second one. Thanks to Evert for allaying my apprehension about it.

+1  A: 

In general database size will not be much of a concern. Speed is much more important.

Therefore, I would be tempted to go for option two. Just like you mentioned, it makes things like deleting messages a lot easier, and I'm pretty sure this is a very common way to do this.

Evert
+2  A: 

You may want to avoid copying the message body multiple times in the case where a message is sent to multiple recipients. Here is another option which you may want to consider:

  • users (user_id)

  • messages (message_id, sender_id, subject, body)

  • received_messages (message_id, user_id, address_mode, deleted)

This model may be more twitter-like than email-like, but it may come with some advantages.

The rules are that:

  • A message can only be sent by one user, referenced in the sender_id of each message.
  • Each recipient will be defined in the received_messages table. The address_mode field can define whether a message was sent to the recipient directly, or as a CC, or maybe as BCC. This field is obviously optional.
  • Deleted messages by recipients will mark the deleted flag in the received_messages table.
  • Forwarded and replied-to messages need to be recreated with a new sender_id. The message body can be then modified.

These are some of the advantages:

  • This takes less space than the two options mentioned in the original question, especially if users will typically send messages to multiple recipients.
  • Easier caching of the messages table, since messages are never duplicated.
  • The recipient deleting a message will not erase the information that the message was sent to this user. It will simply be marked as 'deleted' in the received_messages table.
  • And you also get a normalized model.

For most applications, if you use an optimistic isolation level with the above model, you should not have performance problems even if you are expecting messages to be exchanged at a rate of a few per second. If on the other hand you're expecting hundreds or thousands of messages per second, then it may really be the case to consider other options.

Daniel Vassallo
Thanks! I considered this design too. Actually after posting this question I (mentally :) ) modified the first design toUsers (user_id)Messages (message_id, sender_id , receiver_id , message_content_id )Message_content (message_content_id, subject, body)So both these designs are better than the first one (in space taken and table performance).Maybe the question statement had to be something like: is it better to store all messages in one table, or to copy them in two? How significant will the impact on performance be?
Sergii Vozniuk