tags:

views:

55

answers:

4

Ok say I have a paragraph of text:

After getting cut for the first and last time in his life, Durant watched from the sofa as the American team waltzed into the gold-medal game and then was tested by Spain, ultimately emerging with a 118-107 victory that ended an eight-year gold-medal drought for the senior U.S. men's national team. But the gold-medal drought for the Americans in the FIBA World Championship remains intact, now at 16 years and counting as Team USA prepares to head to Turkey without any of the members of the so-called Redeem Team from Beijing.

What I would like to do is to run a php preg_match_all a few keywords (say example 'team' and 'for') on the text, and then retrieve a snippet (maybe 10 words before and 10 words after) for each of the result found.

Anyone has any idea how that can be done?

A: 

Check this http://www.php.net/manual/en/regexp.reference.squarebrackets.php

So this is one word with a separator:

([:word:].*[:punct:])

These are ten words with sep.

([:word:].*[:punct:]){10}

Something like this would be close to your solution:

([:word:].*[:punct:].){10}team([:punct:].[:word:].*){10}
Alex
600 characters are not enough to describe all the problems with this answer. Please, just delete it.
Alan Moore
A: 

You might find a lot of interesting ideas in the Drupal search exerpt builder.

http://api.drupal.org/api/function/search_excerpt/6

This one is UTF8-safe and has all kinds of edge-cases covered.

berkes
+2  A: 

You could do this:

  • Get a list of all words and their offsets using preg_match_all with PREG_OFFSET_CAPTURE flag.
  • Iterate the words and find the search term.
  • Get the x words before and after the match.

Here’s an example:

preg_match_all('/[\w-]+/u', $str, $matches, PREG_OFFSET_CAPTURE);
$term = 'team';
$span = 3;
for ($i=0, $n=count($matches[0]); $i<$n; ++$i) {
    $match = $matches[0][$i];
    if (strcasecmp($term, $match[0]) === 0) {
        $start = $matches[0][max(0, $i-$span)][1];
        $end = $matches[0][min($n-1, $i+$span+1)][1];
        echo ' … '.substr($str, $start, $end-$start).' … ';
    }
}
Gumbo
A: 

Somthing like this will do the trick having in mind that the words you search should be at about 4 words atleast distance or it will not match it.. you can change this and adjust. This way you can adjust the importance of the relation between the keywords

preg_match_all("~([\w]+[\s\- ,]+){0,3}watched([\s\- ,]+[\w]+){0,4}\ssofa([\s\- ,]+[\w]+){0,3}~i", $text, $matches);
budinov.com