views:

169

answers:

4

Why does the following segfault, and how can I prevent it?

<?php

$str = ' <fieldset> <label for="go-to">Go to: </label>  ' 
       . str_repeat(' ', 10000) 
       . '<input type="submit" value="Go" /> </fieldset> </form>';

preg_match_all("@
</?(?![bisa]\b)(?!em\b)[^>]*> # starting tag, must not be one of several inline tags
(?:[^<]|</?(?:(?:[bisau]|em|strong|sup)\b)[^>]*>)* #allow text and some inline tags
[\?\!\.]+
@ix", $str, $matches);

?>

I believe it's causing a .... wait for it .... stack overflow.

EDIT:

The above is a simplified version the pattern that demonstrates the problem. A more complete version:

@
</?(?![bisa]\b)(?!em\b)[^>]*> # starting tag, must not be one of several inline tags
(?:[^<]|</?(?:(?:[bisau]|em|strong|sup)\b)[^>]*>)* # continue, allow text content and some inline tags

# normal sentence ending
[\?\!\.]+ # valid ending characters -- note elipses allowed
(?<!\b[ap]m\.)(?<!\b[ap]\.m\.)(?<!digg this\!)(?<!Stumble This\!) # disallow some  false positives that we don't care about
\s*
(?:&apos;|&\#0*34;|'|&lsquo;)?\s* # closing single quotes, in the unusual case like "he said: 'go away'".
(?:"|&quot;|&\#0*34;|&\#x0*22;|&rdquo;|&\#0*8221;|&\#x0*201D;|''|``|\xe2\x80\x9d|&\#0*148;|&\#x0*94;|\x94|\))?\s* # followed by any kind of close-quote char
(?=\<) # should be followed by a tag.
@ix

The purpose is to find html blocks that appear to end at what looks like a valid English sentence ending. I have found that this method is very good at telling the difference between 'content' text (like an article body) and 'layout' text (i.e., like navigational elements). Sometimes if there's a vast amount of white space in between tags it blows up, however.

A: 

Does this still do what you want?

</?(?![bisa]\b)(?!em\b)[^>]*> # starting tag, must not be one of several inline tags
(?:(?>[^<\?\!\.]*)|</?(?:(?:[bisau]|em|strong|sup)\b)[^>]*>)* #allow text and some inline tags
[\?\!\.]+
Jeremy Stein
Close, but not quite: this will disallow any [?!.] before the [\?\!\.]+ at the end, but I want to allow them
ʞɔıu
Would you mind providing a test case where this version matches differently from yours?
Jeremy Stein
A: 

Your regular expression causes massive amounts of backtracking. With 10000 characters in the middle, it would get pretty messy and slow. Still, I would not expect it to crash...!

rikh
Why wouldn't it crash? How can backtracking be handled other than with a stack? What happens when the stack burns through all available memory?
Jeremy Stein
i think pcre's stack handling is a fixed size and relatively small actually IIRC.
ʞɔıu
I still would not expect it to crash. fail when it runs out of space, yes, but crash the program? no.
rikh
Recursion is handled by stack and that's why it overflows. Stack overflow IS out of space (handled by operating system, however with possible security flaws).
doc
if you do something similar in python, it won't segfault. it will hang, mind you, but it won't segfault. I think pcre attempts its own stack management or something like that
ʞɔıu
+1  A: 

The first thing I would try is making all the quantifiers possessive and all the groups atomic:

"@</?+(?![bisa]\b)(?!em\b)[^>]*+>
(?>[^<]++|</?+(?>(?>[bisau]|em|strong|sup)\b)[^>]*+>)*+
[?!.]+
@ix"

I think Jeremy's right: it's not backtracking per se that's killing you, it's all the state info the regex engine has to save to make backtracking possible. The regex seems to be constructed in such a way that if it ever has to backtrack, it's going to fail anyway. So use possessive quantifiers and atomic groups and don't bother saving all that useless info.

EDIT: to allow for the sentence-ending punctuation, you could add another alternative to the second line:

(?>[^<?!.]++|(?![^?!.\s<]++<)[?!.]++|</?+(?>(?>[bisau]|em|strong|sup)\b)[^>]*+>)*+

The addition matches one or more of said characters, unless they're the last non-whitespace characters in the element.

Alan Moore
I didn't test it, but I think you've hit the nail on the head here (although it's usually safe to agree with you on matters of regex...).
Bart Kiers
I believe PHP's regex (which is some version of PCRE) only supports atomic groupings, not possessive quantifiers. I don't believe the pattern you suggest would work because the [^<]++ would clobber the [?!.]+ at the end and not allow backtracking.
ʞɔıu
Also, sadly I don't think PHP's regex allows variable length lookbehinds, otherwise I could just do `(?>(?:[^<]|.....)*)(?<=[?!.]+...)`
ʞɔıu
PHP *does* support possessive quantifiers.
Bart Kiers
You have a point about the `[^<]++` clobbering up everything before `[?!.]+`. Although that might not be an issue if the input is always properly formed: ie. there's always an `>` before `[?!.]+`.
Bart Kiers
I'm not in a position to test this right now, but you can experiment with leaving some of the quantifiers greedy and such. There's a lot of overlap in the effects of the atomic groups and possessive quantifiers. But why do you need the `[?!.]+` anyway?
Alan Moore
Never mind, I just read your edit.
Alan Moore
A: 

I'm fairly sure that even newer versions of PHP are bundled with PCRE 7.0 which has known segment fault issues. I don't think that there are any intentions on correcting the issue as it is technically a PCRE issue, not an issue with PHP.

If you tell us what you are attempting to accomplish your best bet would be to try to write an alternate expression.

The bug in question is: http://bugs.php.net/bug.php?id=40909

evolve