views:

136

answers:

2

Searching a file which is written in Hindi(Devanagri) (UTF-16) gave rise to the following problem.

The file contains:

त्रास ततत जुग नींद ना हा बु

Note that the first char 'त्र' is a multiple code point of त + ् + र Now while searching for 'त' I get 4 matches including the त of the first char. I am using Java.

How can I go about searching for 'त''s which are not part of multiple code point chars.

Any help will be appreciated. :)

+1  A: 

You can do this using unicode properties, I believe.

त(?!\p{M}+)

Should match the त code point as long as it is not followed by any code points in the M category, which are characters intended to be combined with other characters. It uses a negative lookahead to make that assertion.

E: and if that doesn't work right away, try

\uxxxx(?!\p{M}+)

Where the xxxx is the number of the त symbol's code point.

Sean Nyman
Thank You Sean :) The negative lookahead works well.
A: 

It seems that the glyph 'त्र' is actually a ligature or conjunct, not a multiple code point character. So I guess you get the expected result (unless you want to match glyphs). See http://en.wikipedia.org/wiki/Devanagari#Conjuncts.

fbonnet
I am a little confused here..Are not glyphs represented by multiple code points? But yes, I want the program to match glyphs. I am using the java.util.regex package.There are a few issues with conjuncts for eg. ध्वं, ल्ल्य throw PatternSyntaxException when taken as input to form the regex using Pattern.compile() method.
Here, each of the base characters' glyph uses one single code point (as do most characters in the BMP), whereas the glyph for the ligature uses several (3). But as you want to match glyphs anyway, Sean's solution suits your needs. I guess that Java has problems with multiple code point sequences.
fbonnet