ansaurus

Question

Optimized regex for N words around a given word (UTF-8)

Answer 1

+1 A:

This worked fine here:

(?:[^\s\r\n]*[\s\r\n]+){0,8}(?:[^\s\r\n]*)consectetur(?:[^\s\r\n]*)(?:[\s\r\n]+[^\s\r\n]*){0,8}

Gives:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, felis non vehicula suscipit,

The performance of this regular expression, however, is absolute crap. I really don't know how to make this more efficient, short of doing it without regular expressions.

The reason for the performance being "absolute crap" for words near the end is that engine tries to start a match on every character and then advances several dozens of characters until it finds out that, in the end, it cannot find the string you're looking for and discards everything.

Artefacto 2010-08-27 15:57:07

Bad example from my part, sorry for that. Try it with the word vitae. I don't know why but when it's further away in the string, it seems to become very very slow.

lpfavreau 2010-08-27 16:35:35

@Ipf Yes, that's why I said it's absolute crap. See my edit.

Artefacto 2010-08-27 16:39:39

Ah, didn't see the edit. I know I could do it without regex but I'd still want to see if someone has an idea so I can learn from it. +1 for the plain word explanation on why the performance is absolute crap. :-)

lpfavreau 2010-08-27 16:43:23

Answer 2

+2 A:

What about using a regex or some other method to split the input text into an array of words. Then run through the words with a loop looking for the target word. Once it's found, then grab the required array slice, join it together and print.

To maintain the original whitespace between words, you can include it at the end of each word.

Also, this could be implemented as a stream parser rather than splitting the whole string first.

ar 2010-08-27 16:16:30

I like the idea on paper, but when you get to implement you'll run into roadblocks (eg: how should you join the pieces back together while maintaining their original separators)?

NullUserException 2010-08-27 16:26:58

@NullUserException, you could include the whitespace with the word token or implement a stream parser that saves the last N word boundaries as it goes through the string.

ar 2010-08-27 16:29:27

If he's not using regular expressions, he might as well loop through the string until he finds the word he wants and then go backwards and forward to find the surrounding words. It will be faster and certainly more memory efficient.

Artefacto 2010-08-27 16:29:34

He doesn't even have to to backwards. He can save the position of the last word before in a circular array with size `n` (where `n` is the number of words he wants to keep before the match) as he traverses the string.

Artefacto 2010-08-27 16:36:25

Answer 3

+1 A:

The problem with using this regex is that it causes the regex engine to catastrophically backtrack. The number of attempts increases exponentially with the size of the string, and that is no good. You might want to look into atomic grouping to improve performance.

Alternatively you could find the first occurrence of the given word and start looking backwards and forwards for words up to the desired length. Pseudo-ish code:

$pos = strpos($find);
$result = $find;

foreach $word before $pos {
    $result = $word . $result;
    $count++
    if ($count >= $target)
        break;
}

foreach $word after $pos {
    $result .= $word;
    $count++
    if ($count >= $target)
        break;
}

Of course finding the words before and after, and handling partial strings can get really messy.

NullUserException 2010-08-27 16:20:10

You should use a circular array like I said in the comment to ar's answer. It's inefficient to traverse a UTF-8 string backwards and very efficient to do it forward.

Artefacto 2010-08-27 16:59:09

Thanks for the link on atomic grouping. I'll look into it.

lpfavreau 2010-08-27 17:39:27

Answer 4

+1 A:

As mentioned earlier, the problem is a very large amount of backtracking. To solve this, I tried to use lookbehind and lookahead to anchor the match to the string. So I came up with:

/consectetur(?<=((?:\S+\s+){0,8})\s*consectetur)\s*(?=((?:\S+\s+){0,8}))/

Unfortunately, this does not work, as variable length lookbehinds are not supported in PCRE (or perl for that matter). So we are left with:

/consectetur\s*(?:\S+\s+){0,8}/

Which only captures the match string and up to 8 words after the match. However, if you use the PREG_OFFSET_CAPTURE flag, get the offset of $match[0], take the substring up to that point, reverse the string with strrev, get the first 0-8 words (using /\s*(?:\S+\s+){0,8}/), reverse the match, and recombine:

$s = "put test string here";
$matches = array();
if (preg_match('/consectetur\s*(?:\S+\s+){0,8}/', $s, $matches, PREG_OFFSET_CAPTURE)) {
  $before = strrev(substr($s, 0, $matches[0][1]));
  $before_match = array();
  preg_match('/\s*(?:\S+\s+){0,8}/', $before, $before_match);
  echo strrev($before_match[0]) . $matches[0][0];
}

You can make it a bit faster on very big strings by taking a safe subset of characters prior to the match, like 100. Then you are only reversing a 100 character string.

All that being said, a solution which doesn't use regular expressions may work better.

wuputah 2010-08-27 16:50:01

Edited to add actual PHP code. Seems to work great on the test string.

wuputah 2010-08-27 17:12:15

I think there I've read somewhere there is a problem with PREG_OFFSET_CAPTURE because it returns the byte offset instead of the actual number of characters and strrev is not multibyte compatible. This would work great on a latin-1 string but not UTF-8 I'm afraid. And reversing UTF-8 in PHP is not efficient, at least the functions I've tried.

lpfavreau 2010-08-27 17:36:53

You actually want the byte offset for `substr`, not the character offset. As far reversing strings in UTF-8, the efficiency of such code could be made quite negligible if you establish a reasonable `substr` length to capture, e.g. `($before * 20)` bytes. Any encoding issues would be at the beginning of the string, which should be cut off when you match `$before` words.

wuputah 2010-08-27 17:47:26

@lpfavreau There's no reason reversing in PHP shouldn't be efficient.

Artefacto 2010-08-28 02:04:41

@Artefacto You need to write a C function for [mb_strrev](http://stackoverflow.com/questions/434250/how-to-reverse-a-unicode-string). :)

wuputah 2010-08-29 03:17:30

@wuputah What I might do is to write an efficient PHP function for that. The accepted answer is inefficient.

Artefacto 2010-08-29 03:23:25

@Artefacto There is still no accepted answer although there are lots of nice potential solutions. Still looking for an efficient PHP-only solution (although efficient will be pale compared to your C function), ideally in regex to learn from it. :-)

lpfavreau 2010-08-29 16:53:14

@Ipfavreau Artefacto was referring to the accepted answer on the mb_strrev question. As for this question, I doubt you'll see any other answers posted.

wuputah 2010-08-29 18:17:58

Answer 5

+2 A:

Here is an internal PHP function that does what you want. It's unlikely you'll be able to beat this performance-wise in a user-land function.

There should be no problem using this for UTF-8 functions, since '\r', '\n' and ' ' (and in general all the ASCII characters) cannot appear as part of another character sequence. So if you pass valid UTF-8 data to both parameters you should be fine. Reversing UTF-8 data as you would normally reverse single character encodings (with strrev) would indeed mean trouble, but this function doesn't do that.

PHP_FUNCTION(surrounding_text)
{
    struct circ_array {
        int *offsets;
        int cur;
        int size;
    } circ_array;
    long before;
    long after;
    char *haystack, *needle;
    int haystack_len, needle_len;
    int i, in_word = 0, in_match = 0;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ssll",
        &haystack, &haystack_len, &needle, &needle_len, &before, &after) 
        == FAILURE)
        return;

    if (needle_len == 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Cannot have empty needle");
        return;
    }

    if (before < 0 || after < 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Number of words after and before should be non-negative");
        return;
    }

    /* saves beggining of match and words before */
    circ_array.offsets = safe_emalloc(before + 1, sizeof *circ_array.offsets, 0);
    circ_array.cur = 0;
    circ_array.size = before + 1;

    for (i = 0; i < haystack_len; i++) {
        if (haystack[i] == needle[in_match]) {
            in_match++;
            if (!in_word) {
                in_word = 1;
                circ_array.offsets[circ_array.cur % circ_array.size] = i;
                circ_array.cur++;
            }
            if (in_match == needle_len)
                break; /* found */
        } else {
            int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';

            if (in_match)
                in_match = 0;

            if (is_sep) {
                if (in_word)
                    in_word = 0;
            } else { /* not a separator */
                if (!in_word) {
                    in_word = 1;
                    circ_array.offsets[circ_array.cur % circ_array.size] = i;
                    circ_array.cur++;
                }
            }
        }
    }

    if (in_match != needle_len) {
        efree(circ_array.offsets);
        RETURN_FALSE;
    }


    /* find words after; in_word is 1 */
    for (i++; i < haystack_len; i++) {
        int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';
        if (is_sep) {
            if (in_word) {
                if (after == 0)
                    break;
                after--;
                in_word = 0;
            }
        } else {
            if (!in_word)
                in_word = 1;
        }
    }

    {
        char *result;
        int start, result_len;
        if (circ_array.cur < circ_array.size)
            start = circ_array.offsets[0];
        else
            start = circ_array.offsets[circ_array.cur % circ_array.size];

        result_len = i - start;
        result = emalloc(result_len + 1);
        memcpy(result, &haystack[start], result_len);
        result[result_len] = '\0';

        efree(circ_array.offsets);
        RETURN_STRINGL(result, result_len, 0);
    }

}

From my tests, the C function is 4 times faster than wuputah's version (and doesn't have the problem of strrev).

Artefacto 2010-08-28 03:21:21

Wow, this is impressive. +1 for probably finding the fastest way to resolve this problem. I didn't have the time to test it, in fact, I've never compiled my own PHP function and I'm not sure it'll be convenient for its distribution, but nonetheless, it doesn't remove anything to how you resolved that problem. I'm still looking for a PHP-only solution but this should get points anyway! Thanks!

lpfavreau 2010-08-28 13:42:09

By the way, when declaring is_sep, you check twice for '\n' so I guess you could remove one check there.

lpfavreau 2010-08-28 13:43:56

@Ipfavreau OK, I removed the extra `\n`. Thanks.

Artefacto 2010-08-28 13:55:43

Wow dude, +1, that is hardcore.

wuputah 2010-08-29 00:26:36

ansaurus

tags:

views:

answers:

Optimized regex for N words around a given word (UTF-8)

related questions