views:

346

answers:

3

In JavaScript:

"ab abc cab ab ab".replace(/\bab\b/g, "AB");

correctly gives me:

"AB abc cab AB AB"

When I use utf-8 characters though:

"αβ αβγ γαβ αβ αβ".replace(/\bαβ\b/g, "AB");

the word boundary operator doesn't seem to work:

"αβ αβγ γαβ αβ αβ"

Is there a solution to this?

+4  A: 

The word boundary assertion does only match if a word character is not preceded or followed by another word character (so .\b. is equal to \W\w and \w\W). And \w is defined as [A-Za-z0-9_]. So \w doesn’t match greek characters. And thus you cannot use \b for this case.

What you could do instead is to use this:

"αβ αβγ γαβ αβ αβ".replace(/(^|\s)αβ(?=\s|$)/g, "$1AB")
Gumbo
thanks. The use of the lookahead (?=...) notation looks interesting as well. Could this be done without it?
cherouvim
@cherouvim: No, it would consume the space after the word that is then the start for the next lookup. So just looking at `"αβ αβ"`, the first match would consume `"αβ |αβ"` (`|` indicates the internal pointer) and the last part would not be matched because there is no leading space left. But since the look-ahead assertion does not consume characters, the position of the pointer after the first match will be `"αβ| αβ"` and the leading space is preserved for the next match.
Gumbo
that is great. Thanks Gumbo!
cherouvim
+1  A: 

Not all the implementations of RegEx associated with Javascript engines a unicode aware.

For example Microsofts JScript using in IE is limited to ANSI.

AnthonyWJones
A: 

Not all Javascript regexp implementation has support for Unicode ad so you need to escape it

"αβ αβγ γαβ αβ αβ".replace(/\u03b1\u03b2/g, "AB"); // "AB ABγ γAB AB AB"

For mapping the characters you can take a look at http://htmlhelp.com/reference/html40/entities/symbols.html

Of course, this doesn't help with the word boundary issue (as explained in other answers) but should at least enable you to match the characters properly

Sean Kinsey
Then why don’t you use the same Unicode escapes for the string as well?
Gumbo
Because one is parsed as a string, and one as a literal RegExp - I'm not sure if it matters though..
Sean Kinsey
But if the regular expression implementation does not support Unicode, how is a Unicode escape sequence like `\u03b1` supposed to be interpreted?
Gumbo