ansaurus

Question

Processing web feed multiple times a day

Answer 1

+1 A:

It appears that you want to keep the processors from falling too far behind the spiders. I would imagine that you want to be able to scale this out as well.

My recommendation is that you implement a queue using an client/server SQL databse. MySQL would work nicely for this purpose.

Design Objectives

Keep the spiders from getting too far ahead of the processors
Allow for a balance of power between spiders and processors (keeping each busy)
Keep data as fresh as possible
Scale out and up as needed

Queue: Create a queue to store the data from the spiders before it is processed. This could be done in several ways, but it does not sound like IO is your bottleneck.

A simple approach would be to have an SQL table with this layout:

TABLE Queue
    Queue_ID int unsigned not null auto_increment primary key
    CreateDate datetime not null
    Status enum ('New', 'Processing')
    Data blob not null

# pseudo code
function get_from_queue()
    # in SQL
    START TRANSACTION;
    SELECT Queue_ID, Data FROM Queue WHERE Status = 'New' LIMIT 1 FOR UPDATE;
    UPDATE Queue SET Status = 'Processing' WHERE Queue_ID = (from above)
    COMMIT
    # end sql

    return Data# or false in the case of no records found

# pseudo code
function count_from_queue()
    # in SQL
    SELECT COUNT(*) FROM Queue WHERE Status = 'New'
    # end sql
    return (the count)

Spider:

So you have multiple spider processes.. They each say:

if count_from_queue() < 10:
    # do the spider thing
    # save it in the queue
else:
    # sleep awhile

repeat

In this way, each spider will be either resting or spidering. The decision (in this case) is based on if there are less than 10 pending items to process. You would tune this to your purposes.

Processor

So you have multiple processor processes.. They each say:

Data = get_from_queue()
if Data:
    # process it
    # remove it from the queue
else:
    # sleep awhile

repeat

In this way, each processor will be either resting or processing.

In summary: Whether you have this running on one computer, or 20, a queue will provide the control you need to ensure that all parts are in sync, and not getting too far ahead of each other.

gahooa 2009-02-15 04:54:37

Dear gahooa, that was neat mate. Very neat.I understood what you mean. I have done something like this before. With another table I can implement audit and control as well. And yes I am going to run this on a cluster. Please add more if you think you have more to share. I appreciate your answer.

2009-02-15 17:11:26

Dear Redfrog: thanks for the comment. As your design develops, feel free to email me for more feedback. Contact info at the bottom of this page: http://blog.gahooa.com/about/

gahooa 2009-02-16 05:00:23

ansaurus

tags:

views:

answers:

Processing web feed multiple times a day

related questions