ansaurus

Question

Answer 1

A:

Does this still do what you want?

</?(?![bisa]\b)(?!em\b)[^>]*> # starting tag, must not be one of several inline tags
(?:(?>[^<\?\!\.]*)|</?(?:(?:[bisau]|em|strong|sup)\b)[^>]*>)* #allow text and some inline tags
[\?\!\.]+

Jeremy Stein 2009-11-12 14:44:03

Close, but not quite: this will disallow any [?!.] before the [\?\!\.]+ at the end, but I want to allow them

ʞɔıu 2009-11-12 16:23:09

Would you mind providing a test case where this version matches differently from yours?

Jeremy Stein 2009-11-12 18:31:44

Answer 2

A:

Your regular expression causes massive amounts of backtracking. With 10000 characters in the middle, it would get pretty messy and slow. Still, I would not expect it to crash...!

rikh 2009-11-12 14:52:33

Why wouldn't it crash? How can backtracking be handled other than with a stack? What happens when the stack burns through all available memory?

Jeremy Stein 2009-11-12 14:59:41

i think pcre's stack handling is a fixed size and relatively small actually IIRC.

ʞɔıu 2009-11-12 16:03:27

I still would not expect it to crash. fail when it runs out of space, yes, but crash the program? no.

rikh 2009-11-12 16:20:09

Recursion is handled by stack and that's why it overflows. Stack overflow IS out of space (handled by operating system, however with possible security flaws).

doc 2009-11-12 16:27:02

if you do something similar in python, it won't segfault. it will hang, mind you, but it won't segfault. I think pcre attempts its own stack management or something like that

ʞɔıu 2009-11-12 17:01:54

Answer 3

+1 A:

The first thing I would try is making all the quantifiers possessive and all the groups atomic:

"@</?+(?![bisa]\b)(?!em\b)[^>]*+>
(?>[^<]++|</?+(?>(?>[bisau]|em|strong|sup)\b)[^>]*+>)*+
[?!.]+
@ix"

I think Jeremy's right: it's not backtracking per se that's killing you, it's all the state info the regex engine has to save to make backtracking possible. The regex seems to be constructed in such a way that if it ever has to backtrack, it's going to fail anyway. So use possessive quantifiers and atomic groups and don't bother saving all that useless info.

EDIT: to allow for the sentence-ending punctuation, you could add another alternative to the second line:

(?>[^<?!.]++|(?![^?!.\s<]++<)[?!.]++|</?+(?>(?>[bisau]|em|strong|sup)\b)[^>]*+>)*+

The addition matches one or more of said characters, unless they're the last non-whitespace characters in the element.

Alan Moore 2009-11-12 16:06:54

I didn't test it, but I think you've hit the nail on the head here (although it's usually safe to agree with you on matters of regex...).

Bart Kiers 2009-11-12 16:23:49

I believe PHP's regex (which is some version of PCRE) only supports atomic groupings, not possessive quantifiers. I don't believe the pattern you suggest would work because the [^<]++ would clobber the [?!.]+ at the end and not allow backtracking.

ʞɔıu 2009-11-12 16:25:23

Also, sadly I don't think PHP's regex allows variable length lookbehinds, otherwise I could just do `(?>(?:[^<]|.....)*)(?<=[?!.]+...)`

ʞɔıu 2009-11-12 16:28:02

PHP *does* support possessive quantifiers.

Bart Kiers 2009-11-12 16:32:26

You have a point about the `[^<]++` clobbering up everything before `[?!.]+`. Although that might not be an issue if the input is always properly formed: ie. there's always an `>` before `[?!.]+`.

Bart Kiers 2009-11-12 16:36:14

I'm not in a position to test this right now, but you can experiment with leaving some of the quantifiers greedy and such. There's a lot of overlap in the effects of the atomic groups and possessive quantifiers. But why do you need the `[?!.]+` anyway?

Alan Moore 2009-11-12 16:56:26

Never mind, I just read your edit.

Alan Moore 2009-11-12 16:59:52

Answer 4

A:

I'm fairly sure that even newer versions of PHP are bundled with PCRE 7.0 which has known segment fault issues. I don't think that there are any intentions on correcting the issue as it is technically a PCRE issue, not an issue with PHP.

If you tell us what you are attempting to accomplish your best bet would be to try to write an alternate expression.

The bug in question is: http://bugs.php.net/bug.php?id=40909

evolve 2009-11-12 16:20:50

ansaurus

tags:

views:

answers:

Need to prevent PHP regex segfault

related questions