views:

58

answers:

1

I used http://translate.google.com/#en|hi|Bangalore to get the Hindi for Bangalore and बंगलौर.

But when I pasted it in vim there is a break before the last character र.
I am using preg_replace with the regex pattern /[^\p{L}\p{Nd}\p{Mn}_]/u for matching words. But this is treating the last character as a separate word.

This is my input string मैनेजमेंट, बंगलौर and I am expecting the output to be मैनेजमेंट बंगलौर after the preg_replace

$cleanedString = preg_replace('/[^\p{L}\p{Nd}\p{Mn}_]/u', ' ', $name);

But the output I am getting is मैनेजमेंट बंगल र . What am I doing wrong here? I guess the problem starts from how vim handled the text I pasted.

+1  A: 

Try this regex "/[^\p{L}\p{Nd}\p{Mn}\p{Mc}_]/u"

The O symbol in लौ takes extra horizontal space as opposed to the ae in मै. The unicode class \p{Mn} matches only non-spacing marks. Use \p{Mc} to match spacing-marks. You can use \p{M} to match all combining-marks: "/[^\p{L}\p{Nd}\p{M}_]/u"

From regular-expressions.info/unicode

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

  • \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
  • \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
  • \p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
Amarghosh
@Amarghosh Thanks. I was trying to match all words in the input string in UTF-8 which ever language they may be in. The regex "/[^\p{L}\p{Nd}\p{Mn}\p{Mc}_]/u" helped
Jithin
I also noticed something in vim. vim seems to have some trouble in detecting the correct word end for the बंगलौर
Jithin
@Jithin can you elaborate on that - may be as a new question if needed?
Amarghosh
@Amarghosh In vim if I press w if should I ideally move to next words beginning. But for बंगलौर w from the start moves to र .
Jithin
@Jithin What about other Hindi words - is it only for bangalore or words ending with `r` or for all indic words?
Amarghosh