views:

2160

answers:

3

Does anyone know of any JavaScript libraries that support Unicode-aware regular expressions? For example, there should be something akin to \w that can match any code-point in Letters or Marks category (not just the ASCII ones), and hopefully have filters like [[P*]] for punctuation etc.

+7  A: 

Even though JavaScript operates on Unicode strings, it does not consistently implement Unicode-aware character classes, and has (to my knowledge) no concept of POSIX character classes or Unicode sub-ranges.

Check your expectations here: Javascript RegExp Unicode Character Class tester

Flagrant Badassery has an article on JavaScript, Regex, and Unicode that sheds some light on the matter.

Be sure to also read Regex and Unicode here on SO. Probably you have to build your own "punctuation character class".

Check out the Regular Expression: Match Unicode Block Range builder, which lets you build a JavaScript regular expression that matches characters that fall in any number of specified Unicode blocks.

I just did it for the "General Punctuation" and "Supplemental Punctuation" sub-ranges, and the result is as simple and straight-forward as I would have expected it:

[\u2000-\u206F\u2E00-\u2E7F]
Tomalak
Great tools! Thanks!
JannieT
+2  A: 

In JavaScript, \w and \d are ASCII, while \s is Unicode. Don't ask me why. JavaScript does support \p with Unicode categories, which you can use to emulate a Unicode-aware \w and \d.

For \d use \p{N} (numbers)

For \w use [\p{L}\p{N}\p{Pc}\p{M}] (letters, numbers, underscores, marks)

Update: Unfortunately, I was wrong about this. JavaScript does does not officially support \p either, though some implementations may still support this. The only Unicode support in JavaScript regexes is matching specific code points with \uFFFF. You can use those in ranges in character classes.

Jan Goyvaerts
A: 

I'm not sure which browser has JavaScript with support for \p with Unicode categories, but Firefox definitely doesn't, unfortunately.