views:

72

answers:

3

So I recently wanted to thread one of my Perl programs to increase its speed. Taking in a list of websites, I wanted to start a thread for each url and get the content of each website and then look for a company description on the page. Once one thread found a result, or all thread's didn't, I wanted to exit, write my result, and read in urls for my next company.

The problem that I see is that I use the Perl::Unsafe::Signals module inside of the function that I call when creating a thread. I need the unsafe signals to interrupt regular expressions that get "stuck". However this seems to cause all sorts of problems, mainly having the program crash and the error msg "Alarm Clock" shown.

Therefore, is there a way to use Perl::Unsafe::Signals and threads safely? Is there a way to timeout a regular expression in another way by sending a signal to the function ( like I send a 'KILL' signal below?) Thanks.

Note: I stripped down the code to all pertinent parts, let me know if you need more.

use threads ('exit' => 'threads_only');
use threads::shared;
my @descrip;
share(@descrip);

my $lock;
share($lock);

URL:foreach my $url(@unique_urls) {
        #skip blank urls
        if(!$url) { next URL; }#if

        #find description
        my $thread = threads->create(\&findCompanyDescription, $PREV_COMPANY, $PREV_BASE_URL, $url);

#while a description has not been found and there are still active threads, keep looking
#there may be a better way to do this, but this seems to work for me
while(!@descrip && threads->list() != 0) {;}

#kill all threads, write output, read in next batch of urls
my @threads = threads->list();
foreach(@threads) { print("detaching\n"); $_->kill('KILL')->detach(); }#foreach
#######SUBROUTINE CALLED BY THREAD CREATE
sub findCompanyDescription {
    my($company_full, $base_url, $url) = @_;
    my($descrip, $raw_meta, $raw) = '';
    my @company;

    $SIG{'KILL'} = sub { alarm(0); threads->exit(); };

    eval {
        local $SIG{ALRM} = sub { die("alarm\n") }; # NB: \n required
        alarm(5);

        use Perl::Unsafe::Signals;
        UNSAFE_SIGNALS {

            while($company) {
            my @matches = ($content =~ m!.*<([\w\d]+).*?>\s*about\s+$company[\w\s\-_]*<.*?>(?:<.*?>|\s)*(.*?)</\1.*?>!sig);

            MATCH:for(my $ndx=1; $ndx<@matches; $ndx+=2) {
            ($raw, $descrip) = &filterResult($matches[$ndx], $company_full);

            if($descrip) {
                $company = undef;
                last(MATCH);
            }#if
        }#for

        #reduce the company name and try again
        $company = &reduceCompanyName($company);

        }#while

        alarm(0);
         };#unsafe_signals
    };#eval 

    if($@) {
        if($@ eq "alarm\n" && $DEBUG) { print("\nWebpage Timeout [].\n"); }#if
    }#if

    if($descrip) { lock($lock); {
       @descrip = ($PREV_ID, $company_full, $base_url, $url, 1, $raw, $descrip); } 
    }#if
+7  A: 

In general, "unsafe" signals are unsafe for both single threaded and multi-threaded. You've only increased your peril by using threads and unsafe signals. Perl's usual safe signal handler sets the flag signal_pending without meaningfull interrupting execution. The VM checks that flag when it's between opcodes.

Your regexp execution is a single, "atomic" opcode. Of course, the regexp itself is another VM with its own opcodes but we don't have currently visibility into that for the perl signal handler.

Frankly, I've no good idea about how to interrupt the regexp engine. It's got some global C state which in the past prior to perl-5.10 prevented it from being reentrant. It might not be safe for universal interruptability like you're trying. If you really wanted it to be fully interruptible, you might want to fork and have your child process do the regexp and communicate the results back over a pipe.

require JSON;
require IO::Select;

my $TIMEOUT_SECONDS = 2.5; # seconds

my ( $read, $write );
pipe $read, $write;

my @matches;
my $pid = fork;
if ( $pid ) {

    my $select = IO::Select->new( $read );
    if ( $select->can_read( $TIMEOUT_SECONDS ) ) {
        local $/;
        my $json = <$read>;
        if ( $json ) {
            my $matches_ref = JSON::from_json( $json );
            if ( $matches_ref ) {
                @matches = @$matches_ref;
            }
        }
    }
    waitpid $pid, 0;
}
else {
    my @r = $content =~ m!.*<([\w\d]+).*?>\s*about\s+$company[\w\s\-_]*<.*?>(?:<.*?>|\s)*(.*?)</\1.*?>!sig;
    my $json = JSON::to_json( \ @r );
    print { $write } $json;
    close $write;
    exit;
}
Josh Jore
If I *really* wanted an interruptible regexp engine, I might try using a different engine as in [http://perldoc.perl.org/perlreapi.html]. Or... I might try embedding a perl regexp engine in the pluggable regexp engine with inter-process calls so that the above idea would then be "merely" just declaring that the above regexp was expected to evaluate in an inferior process and be aborted at will.
Josh Jore
Do you have any advice on which engine to use?
+2  A: 

IMHO, mixing signals and threads is a challenging task per se (i.e. w/o perl-specific things). Remember that even in a single-threaded program you can safely call only async-signal-safe functions from the signal handler because the program may be interrupted at any point. Perl adds another layer of abstraction, so I have no idea about safety of calling "die" from signal handler in case of unsafe signals.

If I remember properly, SIGALRM is asynchronous signal, so it must be handled synchronously. Your way of handling it is generally incorrect in multi-threaded programs.

Moreover, IMHO perl threads just do not work as most people expect. Just avoid using them and use processes instead.

P.S.

The following line doesn't make sense:

$SIG{'KILL'} = sub { alarm(0); threads->exit(); };

SIGKILL (as well as SIGSTOP) cannot be caught.

Dmitry V. Krivenok
I got that (not the alarm part, was trying other stuff out) from the perl docs @ http://perldoc.perl.org/threads.html
A: 

I'm not really specialist on Perl-MT, but one thing you apparently is missing is that signals are global to the whole process - they are not thread specific. On POSIX systems you can't set a signal handler for a thread: signals are delivered to the whole process. IOW alarm() call affects the whole process, not only the thread which calls it. And even local %SIG in MT context doesn't do what one might think it does - because local is a thing of syntax.

Dummy00001