tags:

views:

105

answers:

8

I have a site with about 30,000 members to which I'm adding a functionality that involves sending a random message from a pool of 40 possible messages. Members can never receive the same message twice.

One table contains the 40 messages and another table maps the many-to-many relationship between messages and members.

A cron script runs daily, selects a member from the 30,000, selects a message from the 40 and then checks to see if this message has been sent to this user before. If not, it sends the message. If yes, it runs the query again until it finds a message that has not yet been received by this member.

What I'm worried about now is that this m-m table will become very big: at 30,000 members and 40 messages we already have 1.2 million rows through which we have to search to find a message that has not yet been sent.

Is this a case for denormalisation? In the members table I could add 40 columns (message_1, message_2 ... message_40) in which a 1 flag is added each time a message is sent. If I'm not mistaken, this would make the queries in the cron script run much faster

?

A: 

Normalization reduces redundancy and it is what you'll do if you have large amount of data which seems to be your case. You need not denormalize. Let there be an M-to-M table between members and messages.

You can archive the old data as your M-to-M data increases. I don't even see any conflicts because your cron job runs daily for this task and accounts only for the data for the current day. So you can archive M-to-M table data every week.

I believe there will be maintenance issue if you denormalize by adding additional coloumns to members table. I don't recommend the same. Archiving of old data can save you from trouble.

this. __curious_geek
With archiving old data, you mean deleting records for users that have received all 40 messages?
stef
Shifting the data into another database which is not accessed so frequently so that your main application works well with M-2-M table having only required data and still retain the old data in archived table.
this. __curious_geek
+1  A: 

You could also append the id of the sent messages to a varchar-field in the members-table. Despite of good manners, this would make it easily possible to use one statement to get a message that has not been sent yet for a specific member.

Just like this (if you surround the ids with '-')

SELECT message.id
FROM member, message
WHERE member.id = 2321
AND member.sentmessages NOT LIKE '%-' && id && '-%'
Mulmoth
+4  A: 

I know that doesn't answer your original question, but wouldn't it be way faster if you selected all the messages that weren't yet sent to a user and then select one of those randomly?

See this pseudo-mysql here:

SELECT 
    CONCAT_WS(',', messages.ids) unsent_messages, 
    user.id user
FROM
    messages,
    user
WHERE
    messages.id NOT IN (
     SELECT 
      id 
     FROM 
      sent_messages 
     WHERE 
      user.id = sent_messages.user
    )
GROUP BY ids
André Hoffmann
I agree, I've elaborated on this a bit below
Steve De Caux
Could you clarify what you are doing in the first select statement with CONCAT_WS ?
stef
Well I pretty much concatenate all the ids of messages that are unsent together using a comma separator. This allows to return a list of ids for every user in one SELECT.
André Hoffmann
A: 

You can achieve the effect of sending random messages by preallocating the random string in your m-m table and a pointer to the offset of the last message sent.

In more detail, create a table MemberMessages with columns
memberId,
messageIdList char(80) or varchar ,
lastMessage int,
primary key is memberId.

Pseudo-code for the cron job then looks like this...

ONE. Select next message for a member. If no row exists in MemberMessages for this member, go to step TWO. The sql to select next message looks like

select substr(messageIdList, 2*lastMessage + 1, 2) as nextMessageId  
from MemberMessages  
where member_id = ?

send the message identified by nextMessageId

then update lastMessage incrementing by 1, unless you have reached 39 in which case reset it to zero.

update MemberMessages  
set lastMessage = MOD(lastMessage + 1, 40)  
where member_id = ?

TWO. Create a random list of messageIds as a String of couplets like 2117390740... This is your random list of message IDs as an 80 char String. Insert a row to MemberMessages for your member_id setting message_id_list to your 80 char String and set last_message to 1.

Send the message identified by the first couplet from the list to the member.

Steve De Caux
A: 

You could store only available (unsent) messages. This implies extra maintenance when you add or remove members or message types (nothing that can't be automated with foreign keys and triggers) but simplifies delivery: pick a random line from each user, send the message and remove the line. Also, your database will get smaller as messages get sent ;-)

Álvaro G. Vicario
+1  A: 

1.2 M rows @ 8 bytes (+ overhead) per row is not a lot. It's so small I wouldn't even bet it needs indexing (but of course you should do it).

erikkallen
A: 

You can create a kind of queue / heap.

ReceivedMessages

UserId
MessageId

then:

Pick up a member and select message to send:

SELECT * FROM Messages WHERE MessageId NOT IN (SELECT MessageId FROM ReceivedMessages WHERE UserId = @UserId) LIMIT 1

then insert MessageId and UserId to ReceivedMessages

and do send logic here

I hope that helps.

pablox
A: 

There are potential easier ways to do this, depending on how random you want "random" to be.

Consider that at the beginning of the day you shuffle an array A, [0..39] which describes the order of the messages to be sent to users today.

Also, consider that you have at most 40 Cron jobs, which are used to send messages to the users. Given the Nth cron job, and ID the selected user ID, numeric, you can choose M, the index of the message to send:

M = (A[N] + ID) % 40.

This way, a given ID would not receive the same message twice in the same day (because A[N] would be different), and two randomly selected users have a 1/40 chance of receiving the same message. If you want more "randomness" you can potentially use multiple arrays.

Vladski