views:

2498

answers:

4

What is the easiest way to match non-english characters in a Regex? I would like to match all words individually in an input string, but the language may not be English, so I will need to match things like ü, ö, ß, and ñ. Also, this is in javascript/jquery, so any solution will need to apply to that.

+2  A: 

All Unicode-enabled Regex flavours should have a special character class like \w that match any Unicode letter. Take a look at your specific flavour here.

OregonGhost
This is correct for most flavors of regex, but not for JavaScript, at least according to http://www.regular-expressions.info/javascript.html
Paul Wicks
Bad luck then, I guess. At least you can use then use the Unicode charts posted by olle to find your characters ;)
OregonGhost
I think \w is dependents on the cultural settings on the client.
troelskn
I don't know, but in .NET, you can always specify the culture you want. Apart from that, what is a letter and what not is defined in the Unicode standard and is not dependent on culture.
OregonGhost
+1  A: 

You do the same way as any other character matching, but you use \uXXXX where XXXX is the unicode number of the character.

Look at: http://unicode.org/charts/charindex.html

http://unicode.org/charts/

http://www.decodeunicode.org/

olle
+12  A: 

This should do it:

[^\x00-\x80]+

It matches characters whose ASCII codes are greater than 128. You can do the same thing with Unicode:

[^\u0000-\u0080]+
yjerem
What is with characters outside that range? There are more than 0x80 letters.
OregonGhost
The circumflex negates the characterclass.
troelskn
Then it's still wrong in that it matches any "non-english" letters (and maybe other characters, but I don't know the full Unicode tables), but the question author needs to match all letters, even though the question starts with only the non-english characters. You know, because then it's complete.
OregonGhost
This doesn't answer this question about matching words with non-english characters...
sth
+5  A: 

The situation with regexes, Unicode, and Javascript sucks. It's ridiculous that programmers should have to rely on external libraries to recognize that "Αλφα" is a word, or even that "é" is a letter.

But so it goes.

This guy has written a good library for handling Unicode in Javascript Regexes:

http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode

The Unicode stuff is a plugin to this regex library:

http://stevenlevithan.com/regex/xregexp/

http://stevenlevithan.com/regex/xregexp/xregexp.js

Here's a post about the Unicode extension:

http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin

And the extension page itself:

http://stevenlevithan.com/regex/xregexp/xregexp-unicode.js

Great work but it still bums me out that Javascript is so backwards in this regard.

(He wrote a book for O'Reilly about the topic so it's quite possible that he knows what he's talking about.)

pat