What is the easiest way to match non-english characters in a Regex? I would like to match all words individually in an input string, but the language may not be English, so I will need to match things like ü, ö, ß, and ñ. Also, this is in javascript/jquery, so any solution will need to apply to that.
All Unicode-enabled Regex flavours should have a special character class like \w that match any Unicode letter. Take a look at your specific flavour here.
You do the same way as any other character matching, but you use \uXXXX where XXXX is the unicode number of the character.
This should do it:
[^\x00-\x80]+
It matches characters whose ASCII codes are greater than 128. You can do the same thing with Unicode:
[^\u0000-\u0080]+
The situation with regexes, Unicode, and Javascript sucks. It's ridiculous that programmers should have to rely on external libraries to recognize that "Αλφα" is a word, or even that "é" is a letter.
But so it goes.
This guy has written a good library for handling Unicode in Javascript Regexes:
http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode
The Unicode stuff is a plugin to this regex library:
http://stevenlevithan.com/regex/xregexp/
http://stevenlevithan.com/regex/xregexp/xregexp.js
Here's a post about the Unicode extension:
http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin
And the extension page itself:
http://stevenlevithan.com/regex/xregexp/xregexp-unicode.js
Great work but it still bums me out that Javascript is so backwards in this regard.
(He wrote a book for O'Reilly about the topic so it's quite possible that he knows what he's talking about.)