tags:

views:

67

answers:

2

I am writing regular expressions for unicode text in Java. However for the particular script that I am using - Devanagari (0900 - 097F) there is a problem with word boundaries. \b matches characters which are dependent vowels(like 093E-094C) as they are treated like space characters.

Example: Suppose I have the string: "कमल कमाल कम्हल कम्हाल" Note that 'मा' in the 2nd word is formed by combining म and ा (recognized as a space character). Similarly in the last word. This leads \b to match the 'ल' in 'कमाल' with regular expression \b\w\b which is not correct according to the language.

I hope the example helps.

Can I write a regular expression that behaves like \b except that it doesn't match certain chars? Any feedback will be grateful.

+1  A: 

You should be able to accomplish what you want with the following regex operators:

(?=X)   X, via zero-width positive lookahead
(?!X)   X, via zero-width negative lookahead
(?<=X)  X, via zero-width positive lookbehind
(?<!X)  X, via zero-width negative lookbehind

(The above is quoted from the Java 6 Pattern API docs.)

Use (?<![foo])(?=[foo]) in place of \b before a word, and (?<=[foo])(?![foo]) in place of \b after a word, where "[foo]" is your set of "word characters"

Laurence Gonsalves
I thought about doing that but after reading http://www.regular-expressions.info/wordboundaries.html I was a bit confused if it would work.
rohit.arondekar
+1  A: 

The equivalent for word boundaries (if the boundaries are not what you were expecting for) would be:

 (?<!=[x-y])(<?=[x-y])...(?<=[x-y])(?![x-y])

That is because a "word boundary" means "a location where there is a character on one side and not on the other)

So with look-behind and look-ahead expressions, you can define you own class of characters [x-y] to check when you want to isolate a "word boundary"

VonC
Okay I think I understand now. Both yours and Laurence's answer is right which do I mark as correct? :D
rohit.arondekar