views:

59

answers:

2

I am trying to replace in a string all non word characters with empty string expect for spaces and the put together all multiple spaces as one single space.

Following code does this.

$cleanedString = preg_replace('/[^\w]/', ' ', $name);  
$cleanedString = preg_replace('/\s+/', ' ', $cleanedString);

But when I am trying to use mb_ereg_replace nothing happens.

$cleanedString = mb_ereg_replace('/[^\w]/', ' ', $name);  
$cleanedString = mb_ereg_replace('/\s+/', ' ', $cleanedString);

$cleanedString is same as of that if $name in the above case. What am I doing wrong?

A: 

The input is not Multi-Byte hence the mb function fails.

shamittomar
ok. But can you please explain when we should use mb_ereg_replace instead of preg_replace if my input is in UTF-8. Currently I passed english text as $name. But if tomorrow I use some other language say Hindi will my code break ?
Jithin
Wrong. The multibyte extension can handle single byte encodings.
Artefacto
@Artefacto: OK, My bad.
shamittomar
+4  A: 

mb_ereg_replace doesn't use separators. You may or may not also have to specify the encoding before.

mb_ereg_encoding("UTF-8");
//regex could also be \W
$cleanedString = mb_ereg_replace('[^\w]', ' ', $name);
$cleanedString = mb_ereg_replace('\s+', ' ', $cleanedString);
Artefacto
Thanks that was the mistake I did.If my input is UTF-8 is there any recommendation regarding which method to use?
Jithin
@Jithin If it's UTF-8, you might as well use `preg_replace` with the `u` flag: `preg_replace('/\s+/u', ' ', $cleanedString);`
Artefacto
@Artefacto Thanks. Can you please tell me if it is safe to assume that as long as input is in UTF-8 encoding, the preg_replace will work for most languages?
Jithin
@Jithin Depends on what you mean with "works". It will work in a strict sense -- won't generate corrupted data, but it probably doesn't do what you want. Consider the first regex. In PCRE (the engine `preg_replace` uses), `\w` will only mean `[a-zA-Z0-9_]`. If you want to eliminate all non-word characters, a better option is to use `[^\p{L}\p{Nd}\p{Mn}_]`. This will match all characters that are not (per Unicode) letters, non-spacing marks (for accents, etc), decimal digits and the underscore.
Artefacto
@Artefacto Thanks. Won't \w will match characters depending on the current locale. So based on the language of the input string if I change the locale (using setlocale()) won't it work. Or is the approach wrong? When running in the context of a web server like apache I do not know if locale is influenced by the base system setting or depending in the input request.
Jithin
@Jithin No, it's locale independent.
Artefacto
@Artefacto . Thanks again. But I read that for PCRE \w depends on locale. http://perldoc.perl.org/perlrecharclass.html#Backslashed-sequences
Jithin
SORRY My Bad. I was looking at wrong documentation. It was for perl. :D
Jithin
@Jithin See the man page for PCRE [here](http://www.pcre.org/pcre.txt), "General comments about UTF-8 mode", point 6.
Artefacto
@Artefacto Many Many thanks. One more doubt. Is mb_ereg_replace using PCRE ? Does regex pattern behaves same with mb_ereg_replace?
Jithin
@Jithin No. It uses [oniguruma](http://www.geocities.jp/kosako3/oniguruma/). By default, `\w` means (Letter|Mark|Number|Connector_Punctuation).
Artefacto
@Artefacto Hmm so much to learn. :| . From the documentation http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt, it looks like for unicode input \w can give true work boundary. Is there any recommendation on usage between preg_replace and mb_ereg_replace?But I guess preg_replace being natively available in php could be faster.
Jithin
@Jithin They're both "natively available" (well, actually PCRE cannot be left out in PHP 5.3, but that shouldn't influence the speed). However, you'll find that PHP's PCRE interface functions (preg_ family) are both easier to use and more well-documented. If I had to guess, I'd say PCRE is also faster.
Artefacto
@Artefacto I faced a different issue with respect to characters containing horizontal spaces. [^\p{L}\p{Nd}\p{Mn}_] had to be modified to [^\p{L}\p{Nd}\p{Mn}\p{Mc}_] for matching spacing marks also.I asked this question here http://stackoverflow.com/questions/3598212/php-vim--bangalore-has-a-break-before-the-last-character-
Jithin