views:

267

answers:

4

I am writing a Perl script that is searching for a term in large portions of text. What I would like to display back to the user is a small subset of the text around the search term, so the user can have context of where this search term is used. Google search results are a good example of what I am trying to accomplish, where the context of your search term is displayed under the title of the link.

My basic search is using this:

if ($text =~ /$search/i ) {
    print "${title}:${text}\n";
}

($title contains the title of the item the search term was found in) This is too much though, since sometimes $text will be holding hundreds of lines of text.

This is going to be displayed on the web, so I could just provide the title as a link to the actual text, but there is no context for the user.

I tried modifying my regex to capture 4 words before and 4 words after the search term, but ran into problems if the search term was at the very beginning or very end of $text.

What would be a good way to accomplish this? I tried searching CPAN because I'm sure someone has a module for this, but I can't think of the right terms to search for. I would like to do this without modules if possible, because getting modules installed here is a pain. Does anyone have any ideas?

Thanks in advance!

Brian

A: 

You could try the following:

if ($text =~ /(.*)$search(.*)/i ) {

  my @before_words = split ' ', $1;
  my @after_words = split ' ',$2;

  my $before_str = get_last_x_words_from_array(@before_words);
  my $after_str = get_first_x_words_from_array(@after_words); 

  print $before_str . ' ' . $search . ' ' . $after_str;

}

Some code obviously omitted, but this should give you an idea of the approach.

As far as extracting the title ... I think this approach does not lend itself to that very well.

jonstjohn
+2  A: 

Your initial attempt at 4 words before/after wasn't too far off.

Try:

if ($text =~ /((\S+\s+){0,4})($search)((\s+\S+){0,4})/i) {
    my ($pre, $match, $post) = ($1, $3, $4);
    ...
}
denkfaul
Okay, that works perfectly now, but takes a *very* long time. Using the same data, mine (which doesn't return correct results :) ) runs in less than 1 second. I changed the code to your snippit, and it ran more 15 seconds... Any guesses on how to improve performance?
BrianH
if ($text =~ /((\S+\s+){0,4})($search)((\S+\s+){0,4})/ ) { print "$1$3$4\n"; }This produces the right output, and it flies. Thanks so much for your help!
BrianH
I basically removed the ?: - not sure why that reduces performance to have them in, though...
BrianH
Oooh - sorry - it wasn't the ?: - somehow I removed the /i from the end. My search was running fast because it was running case sensitive. When I add the /i back on the end, the performance slows *way* down. Your original solution works perfectly!
BrianH
So now I need to figure out how to perform this matching case-insensitive, and still be fast...
BrianH
it looks like it works with or without the ?:, it just creates another matched variable if you don't. I'll leave this as is unless someone can pip in and explain what is better in this case :)
denkfaul
Sorry to confuse - my 4th comment explains it was actually yours doing a case-INsensitive match (which I want) that was causing the slowness. If I only search for the term without words around it, case insensitive matches go very fast.
BrianH
+3  A: 

You can use $ and $' to get the string before and after the match. Then truncate those values appropriately. But as blixtor points out, shlomif is correct to suggest using @+ and @- to avoid the performance penalty imposed by $ and #' -

$foo =~ /(match)/;

my $match = $1;
#my $before = $`;
#my $after = $';
my $before = substr($foo, 0, $-[0]);
my $after =  substr($foo, $+[0]);

$after =~ s/((?:(?:\w+)(?:\W+)){4}).*/$1/;
$before = reverse $before;                   # reverse the string to limit backtracking.
$before =~ s/((?:(?:\W+)(?:\w+)){4}).*/$1/;
$before = reverse $before;

print "$before -> $match <- $after\n";
daotoad
Hmm - this actually performs great, even when I turn on case insensitive matching...
BrianH
The reverse trick for grabbing from the back of a string came from a post on Perlmonks called sexeger - http://www.perlmonks.org/index.pl?node_id=33410
daotoad
Using the special variables $` and $' incurs a performance penalty for ALL regexes used anywhere in the program. See shlomif's answers for a better way.
+2  A: 

I would suggest using the positional parameters - @+ and @- (see perldoc perlvar) to find the position in the string of the match, and how much it takes.

Shlomi Fish
+1. That's the best answer, imho. It does not do any unnecessary matching around the real 'match' and does not incur the performance penalty of using $` and $'.