views:

253

answers:

4

I am trying to find out the difficulty of implementing a queue system. I know how to implement a basic queue, so i'll explain a little about what i'm after with some background:

I will be implementing a queue where messages will be placed, this will come from several users, the messages will be scheduled to be posted at user defined times (multiple occurrences are allowed with the precision of Minutes, from a UI perspective i will be restricting: "every minute or every hour" occurrences but id like the system to still be able to handle this).

Here is where my question comes in: Eventually I may be in a situation (and maybe not) where MANY messages need to be posted at the current time, I'd like to have several processes (multiple instances of a script) running to fetch [x,10,25] number of messages from the queue at a time and process them. The problem is: how to do this so that each instance processes unique messages (without processing something that is already being processed by another instance)? I'm worried about current connections, how to lock records, and anything else i may not be thinking about.

Technologies I will be using are PHP and MySQL. I am looking for some solutions to the above, terms I should be using in my searches, real world examples, thoughts, comments and ideas?

Thanks you all!

One solution i came across was using Amazon Simple Queue Service ... it promises unique message processing/locking http://aws.amazon.com/sqs/

+4  A: 

Well, I'd do it like this:

Make your table for messages and add two more fields - "PROCESS_ID" and "PROCESS_TIME". These will be explained later.

Give each process a unique ID. They can generate it at the startup (like a GUID), or you can assign them yourself (then you can tell them apart more easily).

When a process wants to fetch a bunch of messages, it then does something like this:

  1. UPDATE messages SET process_id=$id, process_time=now() where process_id is null LIMIT 20
  2. SELECT * FROM messages WHERE process_id=$id

This will find 20 "free" messages and "lock" them. Then it will find the messages that it locked and process them. After each message is processed, DELETE it.

The UPDATE statement should be pretty atomic, especially if you use InnoDB, which wraps each such statement in a transaction automatically. MySQL should take care of all the concurrency there.

The PROCESS_TIME field is optional, but you can use that to see when a process has hanged. If a message is locked for too long, you can conclude that something went wrong and investigate.

Vilx-
just so i'm understanding you correctly: MySQL should handle each UPDATE in the order they were received and it also sounds like if two updates are received at the same time, MySQL should handle that internally itself, does anyone have a documentation link on this?
farinspace
Look in MySQL documentation under the section about InnoDB. Pay attention to the part about transactions.However I think (but I'm not sure) that this might work even under MyISAM which has no transaction support.The idea is that even if two UPDATE statements can run at the same time, they should definately NOT be able to access the same row at the same time. If they would, this would create major chaos. But if they can't, then you're already safe. You cannot predict exactly which rows the UDPATE will affect, but you know that no two UPDATEs will conflict - which is enough in your case.
Vilx-
I guess with this database pattern you will be polling for messages in queue a lot?
Alfred
How else would you imagine it? Shouldn't be that bad though - MySQL should be able to keep this table in RAM completely. If it gets too large, then you have a more serious problem already...
Vilx-
@Vilx, I believe query cache will be flushed on inserting data. Furthermore how does Process P1 know when Process P2 insert a message in the queue. Will Process P1 poll the table. Because I think the table will be blocked a lot when a lot of conncurrent user do your msql statements?? I think it is a pretty sweet solution if only a couple of users using it, but I think it will run slow otherwise.
Alfred
I don't mean query cache. I mean actual table data. The table would be nearly equivalent to a MEMORY table, except that it would get flushed to disk from time to time. Well, of course, I don't know how MySQL handles caching. As for lots of blocking - I guess the only way to find out is to try it. If the polling happens, say, once every second, then there should be plenty of time for other processes to insert data. Plus, InnoDB locks rows, not whole tables. And with an appropriate transaction isolation level, I think that the performance should be pretty much OK. But, well - who knows. :)
Vilx-
A: 

Hi Farinspace,

Couple of online solutions:

  1. Amazon SQS.
  2. Google appengine queue system

I guess the google solution is much cheaper (Could even be free if not using much).

I have also been thinking about implementing queue in PHP/MYSQL and thought of using:

  1. mysql get_lock to implement some sort of lock.
  2. Put the queue in MYSQL memory heap datastorage, because in memory queue is much faster then on disc queue. But you have the risk of losing data when computer crashes.
  3. Use named pipes to communicate with processes.
Alfred
+1  A: 

You could turn the problem around.

Instead of having the problem of getting things out of the queue at the same time. Publish all info as soon as you get it. But publish it with a rule that it is not suposed to be visible until a certain time. Doing things in this way could help you avoid locking / contention problems.

Shiraz Bhaiji
I like it ... interesting thought, I will definitely keep this one in mind as a solution ... my situation however is slightly different, will will be interacting with an API to actually publish the messages to a 3rd party system.
farinspace
+1  A: 

Have a look at the Beanstalkd message queue. There are PHP clients for it. One of the nice features of Beanstalkd (as opposed to e.g. dropr) is that you can delay messages. That is, you can post a message to the queue and it will not be delivered to a client until X seconds have passed.

Beanstalkd does have one big downside though: It's an in-memory queue. That means if it (or your machine) crashes then the queue is empty and the contents lost. Persistence is a feature planned for the next version of beanstalkd.

Sander Marechal
Which it has now had for some time: -b <dir>Use a binlog to keep jobs on persistent storage in <dir>. Upon startup, beanstalkd will recover any binlog that is present in <dir>, then, during normal operation, append new jobs and changes in state to the binlog.
Kurt
Cool, thanks. I'll be upgrading all my beanstalkd instances then.
Sander Marechal