There are no multibyte 'preg' functions available in PHP, so does that mean the default preg_functions are all mb safe? Couldn't find any mention in the php documentation.
No, they are not. See the question preg_match and UTF-8 in PHP for example.
PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0:
The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables corre- spond to Unicode release 5.1.
PHP currently uses PCRE 7.9; your system might have an older version.
Taking a look at the PCRE lib that comes with PHP 5.2, it appears that it's configured to support Unicode properties and UTF-8. Same for the 5.3 branch.
pcre supports utf8 out of the box, see documentation for the 'u' modifier.
Illustration (\xC3\xA4 is the utf8 encoding for the german letter "ä")
echo preg_replace('~\w~', '@', "a\xC3\xA4b");
this echoes "@@¤@" because "\xC3" and "\xA4" were treated as distinct symbols
echo preg_replace('~\w~u', '@', "a\xC3\xA4b");
(note the 'u') prints "@@@" because "\xC3\xA4" were treated as a single letter.
Some of my more complicated preg functions:
(1a) validate username as alphanumeric + underscore:
preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/',$username)
(1b) possible UTF alternative:
preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/u',$username)
(2a) validate email:
preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix",$email))
(2b) possible UTF alternative:
preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ixu",$email))
(3a) normalize newlines:
preg_replace("/(\n){2,}/","\n\n",$str);
(3b) possible UTF alternative:
preg_replace("/(\n){2,}/u","\n\n",$str);
Do thse changes look alright?