views:

68

answers:

1

Ok, here is in brief the deal: I spider the web (all kind of data, blogs/news/forums) as it appears on internet. Then I process this feed and do analysis on processed data. Spidering is not a big deal. I can get it pretty much in real time as internet gets new data. Processing is a bottleneck, it involves some computationally heavy algorithms.

I am in pursuit of building a strategy to schedule my spiders. The big goal is to make sure that analysis that is produced as end result reflects effect of as much recent input as possible. Start to think of it, the obvious objective is to make sure data does not pile up. I get the data through spiders, pass on to processing code, wait till processing gets over and then spider more. This time bringing all the data which appeared while I was waiting for processing to get over. Okay this is a very broad thought.

Can some of you share your thoughts, may be think loud. If you were me what would go in your mind. I hope I am making sense with my question. This is not a search engine indexing by the way.

+1  A: 

It appears that you want to keep the processors from falling too far behind the spiders. I would imagine that you want to be able to scale this out as well.

My recommendation is that you implement a queue using an client/server SQL databse. MySQL would work nicely for this purpose.


Design Objectives

  • Keep the spiders from getting too far ahead of the processors
  • Allow for a balance of power between spiders and processors (keeping each busy)
  • Keep data as fresh as possible
  • Scale out and up as needed


Queue: Create a queue to store the data from the spiders before it is processed. This could be done in several ways, but it does not sound like IO is your bottleneck.

A simple approach would be to have an SQL table with this layout:

TABLE Queue
    Queue_ID int unsigned not null auto_increment primary key
    CreateDate datetime not null
    Status enum ('New', 'Processing')
    Data blob not null

# pseudo code
function get_from_queue()
    # in SQL
    START TRANSACTION;
    SELECT Queue_ID, Data FROM Queue WHERE Status = 'New' LIMIT 1 FOR UPDATE;
    UPDATE Queue SET Status = 'Processing' WHERE Queue_ID = (from above)
    COMMIT
    # end sql

    return Data# or false in the case of no records found

# pseudo code
function count_from_queue()
    # in SQL
    SELECT COUNT(*) FROM Queue WHERE Status = 'New'
    # end sql
    return (the count)


Spider:

So you have multiple spider processes.. They each say:

if count_from_queue() < 10:
    # do the spider thing
    # save it in the queue
else:
    # sleep awhile

repeat

In this way, each spider will be either resting or spidering. The decision (in this case) is based on if there are less than 10 pending items to process. You would tune this to your purposes.


Processor

So you have multiple processor processes.. They each say:

Data = get_from_queue()
if Data:
    # process it
    # remove it from the queue
else:
    # sleep awhile

repeat

In this way, each processor will be either resting or processing.


In summary: Whether you have this running on one computer, or 20, a queue will provide the control you need to ensure that all parts are in sync, and not getting too far ahead of each other.

gahooa
Dear gahooa, that was neat mate. Very neat.I understood what you mean. I have done something like this before. With another table I can implement audit and control as well. And yes I am going to run this on a cluster. Please add more if you think you have more to share. I appreciate your answer.
Dear Redfrog: thanks for the comment. As your design develops, feel free to email me for more feedback. Contact info at the bottom of this page: http://blog.gahooa.com/about/
gahooa