ansaurus

Question

Regular expression to match non-english characters?

Answer 1

+2 A:

All Unicode-enabled Regex flavours should have a special character class like \w that match any Unicode letter. Take a look at your specific flavour here.

OregonGhost 2008-09-29 18:42:31

This is correct for most flavors of regex, but not for JavaScript, at least according to http://www.regular-expressions.info/javascript.html

Paul Wicks 2008-09-29 18:52:00

Bad luck then, I guess. At least you can use then use the Unicode charts posted by olle to find your characters ;)

OregonGhost 2008-09-29 18:55:15

I think \w is dependents on the cultural settings on the client.

troelskn 2008-09-29 19:19:30

I don't know, but in .NET, you can always specify the culture you want. Apart from that, what is a letter and what not is defined in the Unicode standard and is not dependent on culture.

OregonGhost 2008-09-29 20:56:20

Answer 2

+1 A:

You do the same way as any other character matching, but you use \uXXXX where XXXX is the unicode number of the character.

Look at: http://unicode.org/charts/charindex.html

http://unicode.org/charts/

http://www.decodeunicode.org/

olle 2008-09-29 18:43:57

Answer 3

+12 A:

This should do it:

[^\x00-\x80]+

It matches characters whose ASCII codes are greater than 128. You can do the same thing with Unicode:

[^\u0000-\u0080]+

yjerem 2008-09-29 18:45:10

What is with characters outside that range? There are more than 0x80 letters.

OregonGhost 2008-09-29 18:49:40

The circumflex negates the characterclass.

troelskn 2008-09-29 19:18:28

Then it's still wrong in that it matches any "non-english" letters (and maybe other characters, but I don't know the full Unicode tables), but the question author needs to match all letters, even though the question starts with only the non-english characters. You know, because then it's complete.

OregonGhost 2008-09-29 21:24:43

This doesn't answer this question about matching words with non-english characters...

sth 2010-01-17 15:59:11

Answer 4

+5 A:

The situation with regexes, Unicode, and Javascript sucks. It's ridiculous that programmers should have to rely on external libraries to recognize that "Αλφα" is a word, or even that "é" is a letter.

But so it goes.

This guy has written a good library for handling Unicode in Javascript Regexes:

http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode

The Unicode stuff is a plugin to this regex library:

http://stevenlevithan.com/regex/xregexp/

http://stevenlevithan.com/regex/xregexp/xregexp.js

Here's a post about the Unicode extension:

http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin

And the extension page itself:

http://stevenlevithan.com/regex/xregexp/xregexp-unicode.js

Great work but it still bums me out that Javascript is so backwards in this regard.

(He wrote a book for O'Reilly about the topic so it's quite possible that he knows what he's talking about.)