I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that
A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it and a "\W" on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a "\W". http://www.weask.us/entry/as3-regexp-match-words-boundry-type-characters
And since
\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]). \W matches any non-word characters (short for [^a-zA-Z0-9_]) http://www.javascriptkit.com/javatutors/redev2.shtml
obviously accented characters are not taken into account. This becomes a problem with words like Montréal
. If the é
is considered a word boundary, then al
is a two-letter word. I have tried making my own definition of a word boundary which would allow for accented characters, but seeing as a word boundary isn't even a characters, I don't exactly know how to go about finding it..
Any help?
Here is the relevant JavaScript code, which searches userInput
and finds two-letter words using the re_state
regular expression:
var re_state = new RegExp("\\b([a-z]{2})[,]?\\b", "mi");
var match_state = re_state.exec(userInput);
document.getElementById("state").value = (match_state)?match_state[1]:"";