ansaurus

Question

preg_replace to strip out non-printing characters seems to remove all foreign characters as well

Answer 1

+2 A:

Part of the problem is that you aren't treating the target as a UTF-8 string; you need the /u modifier for that. Also, in UTF-8 any non-ASCII character is represented by two or more bytes, all of them in the range \x80..\xFF. Try this:

preg_replace('/\p{Cc}+/u', '', $value)

\p{Cc} is the Unicode property for control characters, and the u causes both the regex and the target string to be treated as UTF-8.

Alan Moore 2010-07-20 23:26:07

Will it leave valid characters outside of the ASCII range, like the Polish diactric characters (like ąęćśńżź)? I'm looking for a regular expression that will strip invalid UTF-8 sequences (so MySQL won't complain while inserting such a string into the database), but leave everything else untouched.

pako 2010-10-28 10:19:10

I think for that you would want to use `'/\P{Any}/u'` - `Any` should be self-explanatory, and `\P{}` (uppercase) is the negated form of `\p{}`. But I'd be more concerned with how those invalid byte sequences got in there in the first place.

Alan Moore 2010-10-28 13:13:20

Answer 2

+2 A:

You can use Unicode character properties

preg_replace('/[^\p{L}\s]/u','',$value);

(Do add the other classes you want to let through)

If you want to revert unicode to ascii, by no means fullproof but with some nice translations:

echo iconv('utf-8','ascii//translit','éñó'); //prints 'eno'

Wrikken 2010-07-20 23:29:36

ansaurus

tags:

views:

answers:

preg_replace to strip out non-printing characters seems to remove all foreign characters as well

related questions