ansaurus

Question

Answer 1

+1 A:

You can do this using unicode properties, I believe.

त(?!\p{M}+)

Should match the त code point as long as it is not followed by any code points in the M category, which are characters intended to be combined with other characters. It uses a negative lookahead to make that assertion.

E: and if that doesn't work right away, try

\uxxxx(?!\p{M}+)

Where the xxxx is the number of the त symbol's code point.

Sean Nyman 2009-08-25 13:28:20

Thank You Sean :) The negative lookahead works well.

2009-08-27 05:22:19

Answer 2

A:

It seems that the glyph 'त्र' is actually a ligature or conjunct, not a multiple code point character. So I guess you get the expected result (unless you want to match glyphs). See http://en.wikipedia.org/wiki/Devanagari#Conjuncts.

fbonnet 2009-08-25 13:29:31

I am a little confused here..Are not glyphs represented by multiple code points? But yes, I want the program to match glyphs. I am using the java.util.regex package.There are a few issues with conjuncts for eg. ध्वं, ल्ल्य throw PatternSyntaxException when taken as input to form the regex using Pattern.compile() method.

2009-08-27 05:37:32

Here, each of the base characters' glyph uses one single code point (as do most characters in the BMP), whereas the glyph for the ligature uses several (3). But as you want to match glyphs anyway, Sean's solution suits your needs. I guess that Java has problems with multiple code point sequences.

fbonnet 2009-08-28 20:27:05

ansaurus

tags:

views:

answers:

Searching unicode text using regex

related questions