ansaurus

Question

regex to filter all but whitelisted characters from a multi-language string

Answer 1

+1 A:

It is common problem, that russian letters not recognised by \w pattern, so you can use

$allowed = "-+?!,.;:\w\sа-я";

dchekmarev 2010-03-18 20:37:21

Won't [а-я] be expanded to their UTF-8 numbers, so [50-60000] (numbers not accurate)? That would be slow for regex, I think?

Coronatus 2010-03-18 20:39:59

Can you use a character class inside a character class? This would translate to [^-+?!,.;:\w\s[а-я]]

jeroen 2010-03-18 20:46:59

Jeroen,I can, but you right, there is no "subclass", so it adds "[" and "]" symbols. Answer fixed :)

dchekmarev 2010-03-18 20:58:06

Hmmmm, that does not work, if I use that I get a warning: "Message: preg_replace() [function.preg-replace]: Compilation failed: invalid UTF-8 string at offset 5".

jeroen 2010-03-18 21:17:20

I think, that you saved a file not in utf8, but in something like cp1251 encoding.

dchekmarev 2010-03-18 21:23:13

Thanks for your help. It´s just a form that is getting sent in, how can I change it´s encoding?

jeroen 2010-03-18 22:05:32

You can change file encoding with iconv from unix shell (I assume, that it is standard windows 1251):iconv -f cp1251 -t utf-8 < file.php > new_file.php; mv new_file.php file.php

dchekmarev 2010-03-18 22:18:12

I cannot get it to work but I think I´ll try it on another server, perhaps it´s a local problem.

jeroen 2010-03-19 13:46:08

Answer 2

+3 A:

you can use ~~\p{LN}~~ [\p{L}\p{N}] instead of \w , see http://www.php.net/manual/en/regexp.reference.unicode.php

stereofrog 2010-03-18 21:10:57

Thanks. I keep getting *invalid UTF-8 string at offset 5*. Is it possible that PCRE was not compiled with unicode properties and can I check that? At the moment I´m just testing on a ubuntu 9.10 desktop server.

jeroen 2010-03-18 22:02:20

It turns out that \p{LN} is wrong - should be [\p{L}\p{N}]. My bad.

stereofrog 2010-03-18 22:15:44

I cannot get it to work but I think I´ll try it on another server, perhaps it´s a local problem.

jeroen 2010-03-19 13:45:41

ansaurus

tags:

views:

answers:

regex to filter all but whitelisted characters from a multi-language string

related questions