views:

959

answers:

4

I am writing a crawler in Perl, which has to extract contents of web pages that reside on the same server. I am currently using the HTML::Extract module to do the job, but I found the module a bit slow, so I looked into its source code and found out it does not use any connection cache for LWP::UserAgent.

My last resort is to grab HTML::Extract's source code and modify it to use a cache, but I really want to avoid that if I can. Does anyone know any other module that can perform the same job better? I basically just need to grab all the text in the <body> element with the HTML tags removed.

+3  A: 

I use pQuery for my web scraping. But I've also heard good things about Web::Scraper.

Both of these along with other modules have appeared in answers on SO for similar questions to yours:

/I3az/

draegtun
Thanks to your answer. I am wondering, do you know which of the module you mentioned performs better for repetitively extracting in large amount of HTML pages?
Alvin
With Web::Scraper, at least, you can pass it the contents of the page, rather than the URL. That way, you can perform your own caching prior to scraping.
Peter Kovacs
@Alvin: I don't know because I have not idea how Web::Scraper, HTML::TreeBuilder or any other module perform against pQuery. All have their pros
draegtun
+1  A: 

HTML::Extract's features look very basic and uninteresting. If the modules that draegfun mentioned don't interest you, you could do everything that HTML::Extract does using LWP::UserAgent and HTML::TreeBuilder yourself, without requiring very much code at all, and then you would be free to work in caching on your own terms.

hobbs
A: 

I've been using Web::Scraper for my scraping needs. It's very nice indeed for extracting data, and because you can call ->scrape($html, $originating_uri) then it's very easy to cache the result you need as well.

singingfish
A: 

Do you need to do this in real-time? How does the inefficiency affect you? Are you doing the task serially so that you have to extract one page before you move onto the next one? Why do you want to avoid a cache?

Can your crawler download the pages and pass them off to something else? Perhaps your crawler can even run in parallel, or in some distributed manner.

brian d foy
Thanks, you're right, I can run the task in parallel. The bottleneck was solved using the threads module.
Alvin