views:

62

answers:

1

Hi I need to build a site similar to indeed.com and so many others, that tracks a number of advertising sites and parses the HTML to list the ads in my own site.

I know that each source-site needs a particular strategy. That's no problem. My concern is that I want to scan the sites hourly in a batch-fashion.

Is there a better suitable strategy to accomplish this? I've been told that Perl is a very strong batch scripting language .. Is it so?? How do I start?

Best,

+2  A: 

The good news is, you can do this in Perl. The bad news is that this going to complex. Just like it would be in any language.

Start by reading Learning Perl.

Next you'll need to put together your spidering code.

Start with a simple single script that reads one page at a time.

There are many modules for getting webpages. Which to use depends on your needs. It gets even more complex if you need to scrape Javascript generated page. Start with LWP::Simple or WWW::Mechanize. You can expand from there.

There are also many modules for parsing HTML. HTML::Treebuilder is a powerful module that has worked very well for me.

Once you can reliably download and parse a single page, you will need to add the spidering logic. Next you have to decide how you want to traverse the site--breadth or depth first? Are you going to go with a recursive alogorithm? Or perhaps a procedural approach?

If you are scanning many pages you need to scan, you may need to create a controller to manage multiple spiders. You could use Coro, AnyEvent, POE, threads, or a fork based strategy to manage your workers. What you choose will depend on your needs.

You can use the DBI module with the appropriate driver (eg DBD::MySQL) to insert the data in your database.

All you have to do now is generate your web app. There are many toolkits of various levels of complexity and power available. CGI::Application and Catalyst are two popular libraries. HTML::Mason and Squatting are some other options.

All of the modules I listed are available on CPAN. Used appropriately, CPAN will save you a lot of work. For many tasks the problem is too many choices, rather than a lack of them.

The book is, of course, available anywhere books are sold.

daotoad