views:

103

answers:

5

I'm trying to find an optimized regex to return the N words (if available) around another one to build a summary. The string is in UTF-8, so the definition of "words" is larger than just [a-z]. The string that serves as the reference word could be in the middle of a word or not directly surrounded by spaces.

I've already got the following that works but seems actually greedy and chokes when looking for more than 6-7 words around another one:

/(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,4}lorem(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,4}/u

This is the PHP method I've build to do that but I'd need help getting the regex to be less greedy and work for any number of words around.

/**
 * Finds N words around a specified word in a string.
 *
 * @param string $string The complete string to look in.
 * @param string $find The string to look for.
 * @param integer $before The number of words to look for before $find.
 * @param integer $after The number of words to look for after $find.
 * @return mixed False if $find was not found and all the words around otherwise.
 */
private function getWordsAround($string, $find, $before, $after)
{
    $matches = array();
    $find = preg_quote($find);
    $regex = '(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,' . (int)$before . '}' .
        $find . '(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,' . (int)$after . '}';
    if (preg_match("/$regex/u", $string, $matches)) {
        return $matches[0];
    } else {
        return false;
    }
}

If I had the following $string:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit, enim quam adipiscing turpis, eget rutrum 
eros velit non enim. Sed commodo cursus vulputate. Aliquam id diam sed arcu 
fringilla venenatis. Cras vitae ante ut tellus malesuada convallis. Vivamus 
luctus ante vel ligula eleifend condimentum. Donec a vulputate velit. 
Suspendisse velit risus, volutpat at dapibus vitae, viverra vel nulla."

And called getWordsAround($string, 'vitae', 8, 8) I'd want to get the following result:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit,"

Thank you for your help regex gurus.

+1  A: 

This worked fine here:

(?:[^\s\r\n]*[\s\r\n]+){0,8}(?:[^\s\r\n]*)consectetur(?:[^\s\r\n]*)(?:[\s\r\n]+[^\s\r\n]*){0,8}

Gives:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, felis non vehicula suscipit,

The performance of this regular expression, however, is absolute crap. I really don't know how to make this more efficient, short of doing it without regular expressions.

The reason for the performance being "absolute crap" for words near the end is that engine tries to start a match on every character and then advances several dozens of characters until it finds out that, in the end, it cannot find the string you're looking for and discards everything.

Artefacto
Bad example from my part, sorry for that. Try it with the word vitae. I don't know why but when it's further away in the string, it seems to become very very slow.
lpfavreau
@Ipf Yes, that's why I said it's absolute crap. See my edit.
Artefacto
Ah, didn't see the edit. I know I could do it without regex but I'd still want to see if someone has an idea so I can learn from it. +1 for the plain word explanation on why the performance is absolute crap. :-)
lpfavreau
+2  A: 

What about using a regex or some other method to split the input text into an array of words. Then run through the words with a loop looking for the target word. Once it's found, then grab the required array slice, join it together and print.

To maintain the original whitespace between words, you can include it at the end of each word.

Also, this could be implemented as a stream parser rather than splitting the whole string first.

ar
I like the idea on paper, but when you get to implement you'll run into roadblocks (eg: how should you join the pieces back together while maintaining their original separators)?
NullUserException
@NullUserException, you could include the whitespace with the word token or implement a stream parser that saves the last N word boundaries as it goes through the string.
ar
If he's not using regular expressions, he might as well loop through the string until he finds the word he wants and then go backwards and forward to find the surrounding words. It will be faster and certainly more memory efficient.
Artefacto
He doesn't even have to to backwards. He can save the position of the last word before in a circular array with size `n` (where `n` is the number of words he wants to keep before the match) as he traverses the string.
Artefacto
+1  A: 

The problem with using this regex is that it causes the regex engine to catastrophically backtrack. The number of attempts increases exponentially with the size of the string, and that is no good. You might want to look into atomic grouping to improve performance.

Alternatively you could find the first occurrence of the given word and start looking backwards and forwards for words up to the desired length. Pseudo-ish code:

$pos = strpos($find);
$result = $find;

foreach $word before $pos {
    $result = $word . $result;
    $count++
    if ($count >= $target)
        break;
}

foreach $word after $pos {
    $result .= $word;
    $count++
    if ($count >= $target)
        break;
}

Of course finding the words before and after, and handling partial strings can get really messy.

NullUserException
You should use a circular array like I said in the comment to ar's answer. It's inefficient to traverse a UTF-8 string backwards and very efficient to do it forward.
Artefacto
Thanks for the link on atomic grouping. I'll look into it.
lpfavreau
+1  A: 

As mentioned earlier, the problem is a very large amount of backtracking. To solve this, I tried to use lookbehind and lookahead to anchor the match to the string. So I came up with:

/consectetur(?<=((?:\S+\s+){0,8})\s*consectetur)\s*(?=((?:\S+\s+){0,8}))/

Unfortunately, this does not work, as variable length lookbehinds are not supported in PCRE (or perl for that matter). So we are left with:

/consectetur\s*(?:\S+\s+){0,8}/

Which only captures the match string and up to 8 words after the match. However, if you use the PREG_OFFSET_CAPTURE flag, get the offset of $match[0], take the substring up to that point, reverse the string with strrev, get the first 0-8 words (using /\s*(?:\S+\s+){0,8}/), reverse the match, and recombine:

$s = "put test string here";
$matches = array();
if (preg_match('/consectetur\s*(?:\S+\s+){0,8}/', $s, $matches, PREG_OFFSET_CAPTURE)) {
  $before = strrev(substr($s, 0, $matches[0][1]));
  $before_match = array();
  preg_match('/\s*(?:\S+\s+){0,8}/', $before, $before_match);
  echo strrev($before_match[0]) . $matches[0][0];
}

You can make it a bit faster on very big strings by taking a safe subset of characters prior to the match, like 100. Then you are only reversing a 100 character string.

All that being said, a solution which doesn't use regular expressions may work better.

wuputah
Edited to add actual PHP code. Seems to work great on the test string.
wuputah
I think there I've read somewhere there is a problem with PREG_OFFSET_CAPTURE because it returns the byte offset instead of the actual number of characters and strrev is not multibyte compatible. This would work great on a latin-1 string but not UTF-8 I'm afraid. And reversing UTF-8 in PHP is not efficient, at least the functions I've tried.
lpfavreau
You actually want the byte offset for `substr`, not the character offset. As far reversing strings in UTF-8, the efficiency of such code could be made quite negligible if you establish a reasonable `substr` length to capture, e.g. `($before * 20)` bytes. Any encoding issues would be at the beginning of the string, which should be cut off when you match `$before` words.
wuputah
@lpfavreau There's no reason reversing in PHP shouldn't be efficient.
Artefacto
@Artefacto You need to write a C function for [mb_strrev](http://stackoverflow.com/questions/434250/how-to-reverse-a-unicode-string). :)
wuputah
@wuputah What I might do is to write an efficient PHP function for that. The accepted answer is inefficient.
Artefacto
@Artefacto There is still no accepted answer although there are lots of nice potential solutions. Still looking for an efficient PHP-only solution (although efficient will be pale compared to your C function), ideally in regex to learn from it. :-)
lpfavreau
@Ipfavreau Artefacto was referring to the accepted answer on the mb_strrev question. As for this question, I doubt you'll see any other answers posted.
wuputah
+2  A: 

Here is an internal PHP function that does what you want. It's unlikely you'll be able to beat this performance-wise in a user-land function.

There should be no problem using this for UTF-8 functions, since '\r', '\n' and ' ' (and in general all the ASCII characters) cannot appear as part of another character sequence. So if you pass valid UTF-8 data to both parameters you should be fine. Reversing UTF-8 data as you would normally reverse single character encodings (with strrev) would indeed mean trouble, but this function doesn't do that.

PHP_FUNCTION(surrounding_text)
{
    struct circ_array {
        int *offsets;
        int cur;
        int size;
    } circ_array;
    long before;
    long after;
    char *haystack, *needle;
    int haystack_len, needle_len;
    int i, in_word = 0, in_match = 0;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ssll",
        &haystack, &haystack_len, &needle, &needle_len, &before, &after) 
        == FAILURE)
        return;

    if (needle_len == 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Cannot have empty needle");
        return;
    }

    if (before < 0 || after < 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Number of words after and before should be non-negative");
        return;
    }

    /* saves beggining of match and words before */
    circ_array.offsets = safe_emalloc(before + 1, sizeof *circ_array.offsets, 0);
    circ_array.cur = 0;
    circ_array.size = before + 1;

    for (i = 0; i < haystack_len; i++) {
        if (haystack[i] == needle[in_match]) {
            in_match++;
            if (!in_word) {
                in_word = 1;
                circ_array.offsets[circ_array.cur % circ_array.size] = i;
                circ_array.cur++;
            }
            if (in_match == needle_len)
                break; /* found */
        } else {
            int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';

            if (in_match)
                in_match = 0;

            if (is_sep) {
                if (in_word)
                    in_word = 0;
            } else { /* not a separator */
                if (!in_word) {
                    in_word = 1;
                    circ_array.offsets[circ_array.cur % circ_array.size] = i;
                    circ_array.cur++;
                }
            }
        }
    }

    if (in_match != needle_len) {
        efree(circ_array.offsets);
        RETURN_FALSE;
    }


    /* find words after; in_word is 1 */
    for (i++; i < haystack_len; i++) {
        int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';
        if (is_sep) {
            if (in_word) {
                if (after == 0)
                    break;
                after--;
                in_word = 0;
            }
        } else {
            if (!in_word)
                in_word = 1;
        }
    }

    {
        char *result;
        int start, result_len;
        if (circ_array.cur < circ_array.size)
            start = circ_array.offsets[0];
        else
            start = circ_array.offsets[circ_array.cur % circ_array.size];

        result_len = i - start;
        result = emalloc(result_len + 1);
        memcpy(result, &haystack[start], result_len);
        result[result_len] = '\0';

        efree(circ_array.offsets);
        RETURN_STRINGL(result, result_len, 0);
    }

}

From my tests, the C function is 4 times faster than wuputah's version (and doesn't have the problem of strrev).

Artefacto
Wow, this is impressive. +1 for probably finding the fastest way to resolve this problem. I didn't have the time to test it, in fact, I've never compiled my own PHP function and I'm not sure it'll be convenient for its distribution, but nonetheless, it doesn't remove anything to how you resolved that problem. I'm still looking for a PHP-only solution but this should get points anyway! Thanks!
lpfavreau
By the way, when declaring is_sep, you check twice for '\n' so I guess you could remove one check there.
lpfavreau
@Ipfavreau OK, I removed the extra `\n`. Thanks.
Artefacto
Wow dude, +1, that is hardcore.
wuputah