views:

217

answers:

2

I am trying to cleanup a string coming from a search box on a multi-language site.

Normally I would use a regex like:

$allowed = "-+?!,.;:\w\s";
$txt_search = preg_replace("/[^" . $allowed . "]?(.*?)[^" . $allowed . "]?/iu", "$1", $_GET['txt_search']);

and that works fine for English texts.

However, now I need to do the same when the texts entered can be in any language (Russian now, Chinese in the future).

How can I clean up the string while preserving "normal texts" in the original language?

I though about switching to a blacklist (although I´d rather not...) but at this moment the regex just completely destroys all original input.

+1  A: 

It is common problem, that russian letters not recognised by \w pattern, so you can use

$allowed = "-+?!,.;:\w\sа-я";

dchekmarev
Won't [а-я] be expanded to their UTF-8 numbers, so [50-60000] (numbers not accurate)? That would be slow for regex, I think?
Coronatus
Can you use a character class inside a character class? This would translate to [^-+?!,.;:\w\s[а-я]]
jeroen
Jeroen,I can, but you right, there is no "subclass", so it adds "[" and "]" symbols. Answer fixed :)
dchekmarev
Hmmmm, that does not work, if I use that I get a warning: "Message: preg_replace() [function.preg-replace]: Compilation failed: invalid UTF-8 string at offset 5".
jeroen
I think, that you saved a file not in utf8, but in something like cp1251 encoding.
dchekmarev
Thanks for your help. It´s just a form that is getting sent in, how can I change it´s encoding?
jeroen
You can change file encoding with iconv from unix shell (I assume, that it is standard windows 1251):iconv -f cp1251 -t utf-8 < file.php > new_file.php; mv new_file.php file.php
dchekmarev
I cannot get it to work but I think I´ll try it on another server, perhaps it´s a local problem.
jeroen
+3  A: 

you can use \p{LN} [\p{L}\p{N}] instead of \w , see http://www.php.net/manual/en/regexp.reference.unicode.php

stereofrog
Thanks. I keep getting *invalid UTF-8 string at offset 5*. Is it possible that PCRE was not compiled with unicode properties and can I check that? At the moment I´m just testing on a ubuntu 9.10 desktop server.
jeroen
It turns out that \p{LN} is wrong - should be [\p{L}\p{N}]. My bad.
stereofrog
I cannot get it to work but I think I´ll try it on another server, perhaps it´s a local problem.
jeroen