tags:

views:

59

answers:

1

Hi,

I am using the regular expression below to weed out any non-Latin characters. As a result, I found that if I use a string larger than 342 characters, the function fails, everything aborts, and the website connection is reset.

I narroed it down to the \p{P} unicode character property, which matches any punctuation character.

Does anyone know/see where the problem lies, exactly?

preg_match('/^([\p{P}\p{S}&\p{Latin}0-9]|\s)*$/u', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');

+1  A: 

If you're "weeding out" non-Latin characters, why not just do this:

preg_replace('/[^\p{Latin}]+/u', '', $s)

EDIT: Okay, so you're trying to validate the input. I was going to say, use this:

preg_match('/^[\p{Latin}]+$/u', $s)

...but it turns out that only matches Latin letters. I was thinking of Java's undocumented shorthand, \p{L1}, which matches everything in the Latin1 (ISO-8859-1) character set, but in PHP you have to spell it out:

preg_match('/^[\x00-\xFF]+$/u', $s)
Alan Moore
@Alan, thank you. However, I would like to notify the user of the error, and I need the validation to fail in order for an error to occur. Thus the validation rule (the reg expression) needs to look for what 'correct' looks like.
KcYxA
KcYxA
Oh yeah, I meant to suggest that. I knew it was gratuitously inefficient to put the `\s` in its own alternative and wrap the whole thing in a capturing group, but I wouldn't have expected it to go pear-shaped on such a small input.
Alan Moore