views:

84

answers:

3

hi.

creating a distributed crawling python app. it consists of a master server, and associated client apps that will run on client servers. the purpose of the client app is to run across a targeted site, to extract specific data. the clients need to go "deep" within the site, behind multiple levels of forms, so each client is specifically geared towards a given site.

each client app looks something like

main:

parse initial url

call function level1 (data1)

function level1 (data)
 parse the url, for data1
 use the required xpath to get the dom elements
 call the next function
 call level2 (data)


function level2 (data2)
 parse the url, for data2
 use the required xpath to get the dom elements
 call the next function
 call level3

function level3 (dat3)
 parse the url, for data3
 use the required xpath to get the dom elements
 call the next function
 call level4

function level4 (data)
 parse the url, for data4
 use the required xpath to get the dom elements

 at the final function.. 
 --all the data output, and eventually returned to the server        
 --at this point the data has elements from each function...

my question: given that the number of calls that is made to the child function by the current function varies, i'm trying to figure out the best approach.

 each function essentialy fetches a page of content, and then parses 
 the page using a number of different XPath expressions, combined 
 with different regex expressions depending on the site/page.

 if i run a client on a single box, as a sequential process, it'll 
 take awhile, but the load on the box is rather small. i've thought 
 of attempting to implement the child functions as threads from the 
 current function, but that could be a nightmare, as well as quickly 
 bring the "box" to its knees!

 i've thought of breaking the app up in a manner that would allow 
 the master to essentially pass packets to the client boxes, in a 
 way to allow each client/function to be run directly from the 
 master. this process requires a bit of rewrite, but it has a number 
 of advantages. a bunch of redundancy, and speed. it would detect if 
 a section of the process was crashing and restart from that point. 
 but not sure if it would be any faster...

i'm writing the parsing scripts in python..

so... any thoughts/comments would be appreciated...

i can get into a great deal more detail, but didn't want to bore anyone!!

thanks!

tom

A: 

Take a look at the multiprocessing class. It allows you to set up a work queue and a pool of workers -- as you parse the page, you can spawn off tasks to be done by separate processes.

brool
A: 

This sounds like a usecase for MapReduce on Hadoop.

Hadoop Map/Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. In your case, this would be a smaller cluster.

A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.

You mentioned that,

i've thought of breaking the app up in a manner that would allow the master to essentially pass packets to the client boxes, in a way to allow each client/function to be run directly from the master.

From what I understand, you want a main machine (box) to act as a master, and have client boxes that run other functions. For instance, you could run your main() function and parse the initial URLs on it. The nice thing is that you could parallelize your task for each of these URLs across different machines, since they appear to be independent of each other.

Since level4 depends on level3, which depends on level2 .. and so on, you can just pipe the output of each to the next rather than calling one from each.

For examples on how to do this, I would recommend checking out, in the given order, the following tutorials,

Hope this helps.

viksit
A: 

Check out the scrapy package. It will allow for easy creation of your "client apps" (a.k.a crawlers, spiders, or scrapers) that go "deep" into a website.

brool and viksit both have good suggestions for the distributed part of your project.

tgray