views:

363

answers:

8

I'm designing a very simple (in terms of functionality) but difficult (in terms of scalability) system where users can message each other. Think of it as a very simple chatting service. A user can insert a message through a php page. The message is short and has a recipient name.

On another php page, the user can view all the messages that were sent to him all at once and then deletes them on the database. That's it. That's all the functionality needed for this system. How should I go about designing this (from a database/php point of view)?

So far I have the table like this:

  • field1 -> message (varchar)
  • field2 -> recipient (varchar)

Now for sql insert, I find that the time it takes is constant regardless of number of rows in the database. So my send.php will have a guaranteed return time which is good.

But for pulling down messages, my pull.php will take longer as the number of rows increase! I find the sql select (and delete) will take longer as the rows grow and this is true even after I have added an index for the recipient field.

Now, if it was simply the case that users will have to wait a longer time before their messages are pulled on the php then it would have been OK. But what I am worried is that when each pull.php service time takes really long, the php server will start to refuse connections to some request. Or worse the server might just die.

So the question is, how to design this such that it scales? Any tips/hints?

PS. Some estiamte on numbers:

  • number of users starts with 50,000 and goes up.
  • each user on average have around 10 messages stored before the other end might pull it down.
  • each user sends around 10-20 messages a day.


UPDATE from reading the answers so far:

I just want to clarify that by pulling down less messages from pull.php does not help. Even just pull one message will take a long time when the table is huge. This is because the table has all the messages so you have to do a select like this:

select message from DB where recipient = 'John'

even if you change it to this it doesn't help much

select top 1 message from DB where recipient = 'John'

So far from the answers it seems like the longer the table the slower the select will be O(n) or slightly better, no way around it. If that is the case, how should I handle this from the php side? I don't want the php page to fail on the http because the user will be confused and end up refreshing like mad which makes it even worse.

+1  A: 

So the question is, how to design this such that it scales? Any tips/hints?

Yes, you don't want to use a relational database for message queuing. What you are trying to do is not what a relational database is best designed for, and while you can do it, its kinda like driving in a nail with a screwdriver.

Instead, look at one of the many open source message queues out there, the guys at SecondLife have a neat wiki where they reviewed a lot of them.

http://wiki.secondlife.com/wiki/Message_Queue_Evaluation_Notes

FlySwat
+3  A: 

the database design for this is simple as you suggest. As far as it taking longer once the user has more messages, what you can do is just paginate the results. Show the first 10/50/100 or whatever makes sense and only pull those records. Generally speaking, your times shouldn't increase very much unless the volume of messages increases by an order of magnatude or more. You should be able to pull back 1000 short messages in way less than a second. Now it may take more time for the page to display at that point, but thats where the pagination should help.

I would suggest though going through and thinking of future features and building your database out a little more based on that. Adding more features to the software is easy, changing the database is comparatively harder.

Ryan Guill
I wish I knew why everyone on this question is getting mass down-voted. There may be some controversial advice here, but I don't believe any of it is inaccurate. If you are down-voting, please explain why, especially if you are the one asking the question. Let us clarify if necessary.
Ryan Guill
I agree, I'm upvoting this as well. This is a legitimate question for Stack Overflow and it should get the chance to have a good answer.
St. John Johnson
I agree with both of your sentiments, I don't have the points to down-vote but if I ever do I am taking a mental note to be sure and leave a comment explaining why.
thinkhard
A: 

You could always have only one row per user and just concatenate messages together into one long record. If you're keeping messages for a long period of time, that isn't the best way to go, but it reduces your problem to a single find and concatenate at storage time and a single find at retrieve time. It's hard to say without more detail - part of what makes DB design hard is meeting all the goals of the system in a well-compromised way. Without all the details, its hard to give advice on the best compromise.

EDIT: I thought I was fairly clear on this, but evidently not: You would not do this unless you were blanking a reader's queue when he reads it. This is why I prompted for clarification.

The only criticism I have to offer is you should make a test case of this and see exactly how performance differs. I think you have a fundamental misunderstanding of how good databases are at what they're designed to do.
Frank Crook
You should not add repeating groups to columns. This is a poor design decision.
thinkhard
A: 

Limit the number of rows that your pull.php will display at any one time.

The more data you transfer, longer it will take to display the page, regardless of how great your DB is.

You must limit your data in the SQL, return the most recent N rows.

EDIT Put an index on Recipient and it will speed it up. You'll need another column to distinguish rows if you want to take the top 50 or something, possibly SendDate or an auto incrementing field. A Clustered index will slow down inserts, so use a regular index there.

KM
"A Clustered index will slow down inserts, so use a regular index there." Don't do this. http://www.sql-server-performance.com/tips/clustered_indexes_p1.aspx
DBAndrew
Hello? he is using mySql and not SQL Server.
KM
A: 

This is an unavoidable problem - more messages, more time to find the requested ones. The only thing you can do is what you already did - add an index and turn O(n) look up time for a complete table scan into O(log(u) + m) for a clustered index look up where n is the number of total messages, u the number of users, and m the number of messages per user.

Daniel Brückner
+3  A: 
  1. Follow the rules of normalization. Try to reach 3rd normal form. To go further for this type of application probably isn’t worth it. Keep your tables thin.
  2. Don’t actually delete rows just mark them as deleted with a bit flag. If you really need to remove them for some type of maintenance / cleanup to reduce size. Mark them as deleted and then create a cleanup process to archive or remove the records during low usage hours.
  3. Integer values are easier for SQL server to deal with then character values. So instead of where recipient = 'John' use WHERE Recipient_ID = 23 You will gain this type of behavior when you normalize your database.
DBAndrew
Adding joins in this situation would only exacerbate his problem. I'm a huge believe in 3rd normal form, but if he is already seeing performance problems on a single table query with proper indexes, normalizing won't help.
FlySwat
+3  A: 

Don't use VARCHAR for your recipient. It's best to make a Recipient table with a primary key that is an integer (or bigint if you are expecting extremely large quantities of people).

Then when you do your select statements:

SELECT message FROM DB WHERE recipient = 52;

The speed retrieving rows will be much faster.

Plus, I believe MySQL indexes are B-Trees, which is O(log n) for most cases.

St. John Johnson
+2  A: 

A database table without an index is called a heap, querying a heap results in each row of the table being evaluated even with a 'where' clause, the big-o notation for a heap is O(n) with n being the number of rows in the table. Adding an index (and this really depends on the underlying aspects of your database engine) results in a complexity of O(log(n)) to find the matching row in the table. This is because the index most certainly is implemented in a b-tree sort of way. Adding rows to the table, even with an index present is an O(1) operation.

 > But for pulling down messages, my pull.php will take longer as the number of rows 
 increase! I find the sql select (and delete) will take longer as the rows grow and
 this is true even after I have added an index for the recipient field.

UNLESS you are inserting into the middle of an index, at which point the database engine will need to shift rows down to accommodate. The same occurs when you delete from the index. Remember there is more than one kind of index. Be sure that the index you are using is not a clustered index as more data must be sifted through and moved with inserts and deletes.

FlySwat has given the best option available to you... do not use an RDBMS because your messages are not relational in a formal sense. You will get much better performance from a file system.

dbarker has also given correct answers. I do not know why he has been voted down 3 times, but I will vote him up at the risk that I may lose points. dbarker is referring to "Vertical Partitioning" and his suggestion is both acceptable and good. This isn't rocket surgery people.

My suggestion is to not implement this kind of functionality in your RDBMS, if you do remember that select, update, insert, delete all place locks on pages in your table. If you do go forward with putting this functionality into a database then run your selects with a nolock locking hint if it is available on your platform to increase concurrency. Additionally if you have so many concurrent users, partition your tables vertically as dbarker suggested and place these database files on separate drives (not just volumes but separate hardware) to increase I/O concurrency.

thinkhard