ansaurus

Question

How can I make a regular expression which takes accented characters into account?

Answer 1

A:

Have you set javascript to use non-ASCII? Here is a page that suggests setting javascript to use UTF-8: http://blogs.sun.com/shankar/entry/how_to_handle_utf_8

It says:

add a charset attribute (charset="utf-8") to your script tags in the parent page:
script type="text/javascript" src="[path]/myscript.js"  charset="utf-8"

Beel 2010-09-12 05:10:14

@Beel: That didn't change anything...

Shawn 2010-09-12 17:12:47

Yeah, the type attribute isn't even in HTML5 as it isn't supported by browsers, it's a mistake people made when interpreting the spec. The charset meta tag works, but charset in links isn't a real thing.

Rich Bradshaw 2010-09-12 18:00:52

@Rich Bradshaw: I do have <meta http-equiv="content-type" content="text/html; charset=utf-8" /> in my head section. Is that what you mean?

Shawn 2010-09-12 18:15:02

That's wrong too. The speech marks people added for XHTML should define two attributes: content and charset, but popular wisdom put them in the same speechmarks with a semicolon for some reason! Browsers do parse that and make it work though. Check the HTML5 version of this for the best/conforming way to do it. Charset on js and CSS has never worked though and is pointless to add.

Rich Bradshaw 2010-09-13 06:37:11

Answer 2

+1 A:

While JavaScript regexes recognize non-ASCII characters in some cases (like \s), it's hopelessly inadequate when it comes to \w and \b. If you want them to work with anything beyond the ASCII word characters, you'll have to either use a different language, or install Steve Levithan's XRegExp library with the Unicode plugin.

By the way, there's an error in your regex. You have a \b after the optional trailing comma, but it should be in front:

"\\b([a-z]{2})\\b,?"

I also removed the square brackets; you would only need those if the comma had a special meaning in regexes, which it doesn't. But I suspect you don't need to match the comma at all; \b should be sufficient to make sure you're at the end of the word. And if you don't need the comma, you don't need the capturing group either:

"\\b[a-z]{2}\\b"

Finally, I would usually recommend that you use a regex literal instead of the RegExp constructor, but if you do switch to XRegExp, you'll have no choice but to use the constructor.

Alan Moore 2010-09-12 07:27:22

@Alan Moore: What's the difference between using the literal and the constructor? The difference I found is that if I use the constructor, I can add the matches of previous regular expressions to my regexp... for example: var re_address = new RegExp(match_buildingNumber[0] + match_street[0] + match_city[0] + "?", "mi"); That kind of thing, which is, to my knoledge, impossible with a regexp literal...

Shawn 2010-09-12 17:22:59

Okay, if you've got a good reason for using the constructor, by all means use it. I just wanted to make sure you were aware of the regex-literal option.

Alan Moore 2010-09-12 17:41:30

@Alan Moore: ok thanks! But I'm still a bit curious.. What IS the difference between the two? Why should one prefer using the literal when possible? Also, I downloaded XRegExp and the unicode plugin, but I still don't see how to use it for what I want. I guess there would be a Lm (modified letter) somewhere in there?

Shawn 2010-09-12 17:55:14

Or maybe the Mn category?

Shawn 2010-09-12 18:20:55

It's just that, with the constructor you're writing the regex in the form of a string literal, which has its own set of escaping rules. For example, if you forgot to escape the backslashes in your regex, you'd be looking for a word surrounded by backspaces, not a word surrounded by word boundaries.

Alan Moore 2010-09-13 05:15:02

ansaurus

tags:

views:

answers:

How can I make a regular expression which takes accented characters into account?

related questions