views:

1891

answers:

6

We use the excellent validator plugin for jQuery here on Stack Overflow to do client-side validation of input before it is submitted to the server.

It generally works well, however, this one has us scratching our heads.

The following validator method is used on the ask/answer form for the user name field (note that you must be logged out to see this field on the live site; it's on every /question page and the /ask page)

$.validator.addMethod("validUserName",
  function(value, element) {
  return this.optional(element) || 
  /^[\w\-\s\dÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇßØøÅåÆæÞþÐð]+$/.test(value); },
  "Can only contain A-Z, 0-9, spaces, and hyphens.");

Now this regex looks weird but it's pretty simple:

  • match the beginning of the string (^)
  • match any of these..
    • word character (\w)
    • dash (-)
    • space (\s)
    • digit (\d)
    • crazy moon language characters (àèìòù etc)
  • now match the end of the string ($)

Yes, we ran into the Internationalized Regular Expressions problem. JavaScript's definition of "word character" does not include international characters.. at all.

Here's the weird part: even though we've gone to the trouble of manually adding tons of the valid international characters to the regex, it doesn't work. You cannot enter these international characters in the input box for user name without getting the..

Can only contain A-Z, 0-9, spaces, and hyphens

.. validation return!

Obviously the validation is working for the other parts of the regex.. so.. what gives?

The other strange part is that this validation works in the browser's JavaScript console but not when executed as a part of our standard *.js includes.

/^[\w-\sÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇßØøÅåÆæÞþÐð]+$/ .test('ÓBill de hÓra') = true

We've run into some really bizarre international character issues in JavaScript code before, resulting in some very, very nasty hacks. We'd like to understand what's going on here and why. Please enlighten us!

+1  A: 

international characters listed are part of extended ASCII. the ones added by you are certainly not.

dusoft
+1  A: 

Seeing as the statement works in the console, could this have to do the way your .js files are saved (i.e. ascii or UTF-8) and that the browser is loading them thusly and in the process translates the characters?

Colin
JS doesn't know anything about UTF-8, even if the encoding is set so.
dusoft
But the browser does, doesn't it? What if the file is loaded as UTF-8 and the JS engine of the browser interprets the characters wrongly because the browser loaded the file incorrectly ?
Colin
Yep, the browser cares. If you save an "Ä" as not-Unicode, it will result in an invalid UTF-8 byte stream. Therefore, it never can match an UTF-8 byte stream corresponding to "Ä".
Boldewyn
s/browser cares/browser and hence the JS engine cares/
Boldewyn
+2  A: 

What is the character encoding of the JS file?

For XML QNames I use this RegExp:

/**
 * Definition of an XML Name
 */
var NameStartChar = "A-Za-z:_\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D"+
                    "\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF"+
                    "\uF900-\uFDCF\uFDF0-\uFFFD\u010000-\u0EFFFF";
var NameChar = NameStartChar+"\\-\\.0-9\u00B7\u0300-\u036F\u203F-\u2040";
var Name = "^["+NameStartChar+"]["+NameChar+"]*$";
RegExp (Name).test (value);

It works like a charm also with internationalized characters. Note the escaping. Due to that I'm able to restrict the JS file to ASCII characters only. Therefore I don't get into trouble when dealing with ISO-8859 vs UTF-8 charsets.

This is no more true, if you use character encodings where ASCII is no real subset (like, e.g., in Asia UTF-16).

Cheers,

Boldewyn
As I understood, the validator rules are in an external JS file. Then I bet on that file being in the wrong encoding (i.e., not UTF-8).
Boldewyn
I am opening the file on disk in Notepad2 and it looks correct -- identical to what you see above in ANSI and when I switch to Unicode, UTF-8 encodings, also identical.
Jeff Atwood
That can't be. An ANSI 'Ä' (==ISO-8859-1) has a single-byte representation 'C4', while UTF-8 'Ä' looks in a hex editor like 'C3 84'. What do you mean with 'switch'? Is it real conversion between encodings?
Boldewyn
well, I'm opening the .js file from the server itself in Notepad2 and switching file encodings via the drop-down menu. I can't see any differences in any of them for the regex string. It is entirely possible I'm doing something wrong..
Jeff Atwood
weirdly, this matches true on a string containing a "<". Seemingly because of the last bit of the NameStartChar "\u010000-\u0EFFFF", even though < is \u003C and not in that range. Similarly @, ?, =, and other characters between '9' and 'A'. thoughts on why?
larson4
@larson4: Hm, it can be that your JS engine cuts off after the first 4 digits. But then, `\u0100` still doesn't contain the `<`. Strange, indeed.
Boldewyn
I have created a javascript library to do some of this stuff, not sure how correct or optimal it is, but take a look: http://code.google.com/p/charfunk/
larson4
+9  A: 

I think the email and url validation methods are a good reference here, eg. the email method:

email: function(value, element) {
 return this.optional(element) || /^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i.test(value);
},

The script to compile that regex. Or the one for URLs.

In other words, replacing your arbitraty list of "crazy moon" characters with this could help:

[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]

Basically this avoids the character encoding issues you have elsewhere by replacing the needs-encoding characters with more general defintions. While not necessarly more readable, so far its shorter then your full list.

Jörn Zaefferer
+9  A: 

This isn't really an answer but I don't have 50 rep yet to add a comment... It can definately be attributed to encoding issues.

Yea "ECMA shouldn't care about encoding..." blah blah, well if you're on firefox, go to View > Character Encoding > Western (ISO-8859-1) then try using the Name field.

It works fine for me after changing the encoding manually (granted the rest of the page doesn't like the encoding switch, :P)

(on IE8 you can go to Page > Encoding > Western European (Windows) to get the same effect)

scott
he's right, this magically makes the Name: validation work (!)
Jeff Atwood
also this is surely an answer, and a good one, and you have >50 rep now :)
Jeff Atwood
+1  A: 

Use something like Fiddler or Charles (not Firebug's Net panel, or anything else that's actually inside the browser) to examine what's actually coming over the wire. It's almost certainly an encoding issue: either the file has been saved in some Microsoft character set and is being sent as UTF-8, or maybe the other way around.

In the case of JS RegExps you can, as Boldewyn points out, avoid these problems by specifying the Unicode code point for the characters you want that are outside the US-ASCII range. It would still be as well to make sure you aren't mixing up encodings between the place where the file is saved and the place where it's served, though.

NickFitz
gzip over the wire, so awkward to do
Jeff Atwood
Both Fiddler and Charles can deal with that. IIRC Fiddler (at least in version 2) will offer you a button in the Response viewing area to allow you to view the ungzipped content.
NickFitz