ansaurus

Question

Are the PHP preg_functions multibyte safe?

Answer 1

+1 A:

No, they are not. See the question preg_match and UTF-8 in PHP for example.

Gumbo 2009-11-19 21:03:28

Answer 2

+1 A:

No, you need to use the multibyte string functions like mb_ereg

Ben S 2009-11-19 21:03:47

They're the multi-byte version of the POSIX `ereg` functions, though, which aren't exactly the same as the PCRE `preg` functions.

mercator 2009-11-19 21:28:15

Answer 3

+5 A:

PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0:

The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables corre- spond to Unicode release 5.1.

PHP currently uses PCRE 7.9; your system might have an older version.

Taking a look at the PCRE lib that comes with PHP 5.2, it appears that it's configured to support Unicode properties and UTF-8. Same for the 5.3 branch.

outis 2009-11-19 21:06:46

I'm using PHP 5.3.0 which includes PCRE Version 7.9, I checked the PCRE config.h file which includes the UTF8 definition, so looks like the preg_funcs are safe.Thanks very much for the info!

Spoonface 2009-11-19 21:50:23

Answer 4

+3 A:

pcre supports utf8 out of the box, see documentation for the 'u' modifier.

Illustration (\xC3\xA4 is the utf8 encoding for the german letter "ä")

  echo preg_replace('~\w~', '@', "a\xC3\xA4b");

this echoes "@@¤@" because "\xC3" and "\xA4" were treated as distinct symbols

  echo preg_replace('~\w~u', '@', "a\xC3\xA4b");

(note the 'u') prints "@@@" because "\xC3\xA4" were treated as a single letter.

stereofrog 2009-11-19 21:41:07

Really? Hmm, I'm not overly proficient with regex strings, if you don't mind I might post some of my preg_ code to see what you think?

Spoonface 2009-11-19 22:08:55

Answer 5

A:

Some of my more complicated preg functions:

(1a) validate username as alphanumeric + underscore:

preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/',$username)

(1b) possible UTF alternative:

preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/u',$username)

(2a) validate email:

preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix",$email))

(2b) possible UTF alternative:

preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ixu",$email))

(3a) normalize newlines:

preg_replace("/(\n){2,}/","\n\n",$str);

(3b) possible UTF alternative:

preg_replace("/(\n){2,}/u","\n\n",$str);

Do thse changes look alright?

Spoonface 2009-11-19 22:21:50

'u' modifier only affects how regexp "metasymbols", like "." or "\w", are matched. Ranges like [A-Z] and literals like "\n" remain the same, no matter whether you use 'u' or not.

stereofrog 2009-11-19 22:33:55

alright then, cheers for the info

Spoonface 2009-11-19 23:53:20

ansaurus

tags:

views:

answers:

Are the PHP preg_functions multibyte safe?

related questions