tags:

views:

74

answers:

2

I am currently working on a project that involves crawling certain websites. However sometimes my Perl program will get "stuck" on a website for some reason (can't figure out why) and the program will freeze for hours. To get around this I inserted some code to time out on the subroutine that crawls the webpage. The problem with this is that, lets say I set the alarm to 60 sec, most of the time the page will timeout correctly, but occasionally the program will not time out and just sit for hours on end (maybe forever since I usually kill the program).

On the really bad websites the Perl program will just eat through my memory, taking 2.3GB of RAM and 13GB of swap. Also the CPU usage will be high, and my computer will be sluggish. Luckily if it times out all the resources get released quickly.

Is this my code or a Perl issue? What should I correct and why was it causing this problem?

Thanks

Here is my code:

eval {

    local $SIG{ALRM} = sub { die("alarm\n") };

    alarm 60;
    &parsePageFunction();
    alarm 0;
};#eval

if($@) {

    if($@ eq "alarm\n") { print("Webpage Timed Out.\n\n"); }#if
    else { die($@."\n"); }#else
}#if
+1  A: 

You may want to elaborate on the crawling process.

I'm guessing it's a recursive crawl, where for each crawled page, you crawl all links on it, and repeat crawling all links on all those pages too.

If that's the case, you may want to do two things:

  1. Create some sort of a depth limit, on each recursion you increment the counter and stop crawling if limit is reached

  2. Detect circular linking, if you have a PAGE_A with a link to PAGE_B, and PAGE_B has a link to PAGE_A you'll be crawling until you run out of memory.

Other than that, you should look into using the standard timeout facility of the module you're using, if that's LWP::UserAgent you do LWP::UserAgent->new(timeout => 60)

miedwar
I use the timeout for UserAgent, but this only applies to getting the page, not to after I get the page. I think the problem is occurring after I get the page.
+4  A: 

Depending on where exactly in the code it is getting stuck, you might be running into an issue with perl's safe signals. See the perlipc documentation on workarounds (E.g. Perl::Unsafe::Signals).

runrig
I'm sorry I should have been more clear. Maybe scraping would be a better term instead of crawling. Basically I obtain applicable application from the content of the page, only going into urls that lead to more applicable content. Therefore I'm not going into many urls, if any so the depth limit is always 1.Could this actually be a REGEX issue where the results are endless and keep requesting more memory? This doesn't seem likely to me, but throwing it out there.Is there any way to exit a function based on how much memory the program is using?
@user387049 Yes, this could totally be a regex. Safe signals mean that alarm won't interrupt a discrete Perl operation, such as a regex. See http://rt.perl.org/rt3//Public/Bug/Display.html?id=73464
Schwern
Using Perl::Unsafe::Signals solved the problem. Some REGEX were locking up and the alarm couldn't interrupt. Thanks for the help!