views:

160

answers:

2

I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that

A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it and a "\W" on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a "\W". http://www.weask.us/entry/as3-regexp-match-words-boundry-type-characters

And since

\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]). \W matches any non-word characters (short for [^a-zA-Z0-9_]) http://www.javascriptkit.com/javatutors/redev2.shtml

obviously accented characters are not taken into account. This becomes a problem with words like Montréal. If the é is considered a word boundary, then al is a two-letter word. I have tried making my own definition of a word boundary which would allow for accented characters, but seeing as a word boundary isn't even a characters, I don't exactly know how to go about finding it..

Any help?

Here is the relevant JavaScript code, which searches userInput and finds two-letter words using the re_state regular expression:

var re_state = new RegExp("\\b([a-z]{2})[,]?\\b", "mi");
var match_state = re_state.exec(userInput);
document.getElementById("state").value = (match_state)?match_state[1]:"";
A: 

Have you set javascript to use non-ASCII? Here is a page that suggests setting javascript to use UTF-8: http://blogs.sun.com/shankar/entry/how_to_handle_utf_8

It says:

add a charset attribute (charset="utf-8") to your script tags in the parent page:

script type="text/javascript" src="[path]/myscript.js"  charset="utf-8"
Beel
@Beel: That didn't change anything...
Shawn
Yeah, the type attribute isn't even in HTML5 as it isn't supported by browsers, it's a mistake people made when interpreting the spec. The charset meta tag works, but charset in links isn't a real thing.
Rich Bradshaw
@Rich Bradshaw: I do have <meta http-equiv="content-type" content="text/html; charset=utf-8" /> in my head section. Is that what you mean?
Shawn
That's wrong too. The speech marks people added for XHTML should define two attributes: content and charset, but popular wisdom put them in the same speechmarks with a semicolon for some reason! Browsers do parse that and make it work though. Check the HTML5 version of this for the best/conforming way to do it. Charset on js and CSS has never worked though and is pointless to add.
Rich Bradshaw
+1  A: 

While JavaScript regexes recognize non-ASCII characters in some cases (like \s), it's hopelessly inadequate when it comes to \w and \b. If you want them to work with anything beyond the ASCII word characters, you'll have to either use a different language, or install Steve Levithan's XRegExp library with the Unicode plugin.

By the way, there's an error in your regex. You have a \b after the optional trailing comma, but it should be in front:

"\\b([a-z]{2})\\b,?"

I also removed the square brackets; you would only need those if the comma had a special meaning in regexes, which it doesn't. But I suspect you don't need to match the comma at all; \b should be sufficient to make sure you're at the end of the word. And if you don't need the comma, you don't need the capturing group either:

"\\b[a-z]{2}\\b"

Finally, I would usually recommend that you use a regex literal instead of the RegExp constructor, but if you do switch to XRegExp, you'll have no choice but to use the constructor.

Alan Moore
@Alan Moore: What's the difference between using the literal and the constructor? The difference I found is that if I use the constructor, I can add the matches of previous regular expressions to my regexp... for example: var re_address = new RegExp(match_buildingNumber[0] + match_street[0] + match_city[0] + "?", "mi"); That kind of thing, which is, to my knoledge, impossible with a regexp literal...
Shawn
Okay, if you've got a good reason for using the constructor, by all means use it. I just wanted to make sure you were aware of the regex-literal option.
Alan Moore
@Alan Moore: ok thanks! But I'm still a bit curious.. What IS the difference between the two? Why should one prefer using the literal when possible? Also, I downloaded XRegExp and the unicode plugin, but I still don't see how to use it for what I want. I guess there would be a Lm (modified letter) somewhere in there?
Shawn
Or maybe the Mn category?
Shawn
It's just that, with the constructor you're writing the regex in the form of a string literal, which has its own set of escaping rules. For example, if you forgot to escape the backslashes in your regex, you'd be looking for a word surrounded by backspaces, not a word surrounded by word boundaries.
Alan Moore