ansaurus

Question

How can I make my Perl web crawler go faster?

Answer 1

+3 A:

Most of the time, your program is probably waiting for a response from the network. There's no away around most of that waiting time (other than putting your computer right next to the computer you want to talk to). Fork off a process to fetch each URL so you can download them simultaneously. You might consider things such as Parallel::ForkManager, POE, or AnyEvent.

brian d foy 2010-09-22 13:41:35

Answer 2

A:

See Brian's answer.

Run lots of copies of it. Use a shared storage system for keeping intermediate and final data.

It might be helpful to take more memory-intensive parts of the crawler (HTML parsing etc) and put those in a separate set of processes.

So have a pool of fetchers which read from the queue of pages to read, and put them into the shared storage area, and a pool of parser processes which read pages and write the results into the results database and queue new pages into the queue to read.

Or something. It really depends on the purpose of your crawler.

Ultimately if you're trying to crawl a lot of pages you'll probably need a lot of hardware and a very fat pipe (to your datacentre/ colo). So you'll need an architecture which allows the parts of the crawler to be split across many machines to scale properly.

MarkR 2010-09-24 23:14:00

ansaurus

tags:

views:

answers:

How can I make my Perl web crawler go faster?

related questions