views:

107

answers:

2

I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.

For example, if you enter a URL directly into the address bar (I'm using Firefox 3.5.5) with a 'q' parameter, you will get the following results:

For the given query string parameter, this is how it's actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).

So, this really shouldn't be a big deal, right? Well, for me, not totally, but sort of. In the application I'm working on, we have a search box in our global navigation. When a user submits a search term in our search box, the 'q' parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.

However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.

As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:

http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni

So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?

+2  A: 

Looks like it is using latin-1 unless any characters can't be represented in that encoding, otherwise it is using UTF-8.

If that is indeed the case, the way to get around this at the other end is to assume everything you receive is UTF-8, and validate it as UTF-8. If it fails validation as UTF-8 then assume it is latin-1 (iso-8859-1).

Due to the way UTF-8 is structured, it is highly unlikely that something that is not actually UTF-8 will pass when validated as UTF-8.

Still, the possibility exists and I don't think Firefox's behaviour is a good idea, though no doubt they have done it as a compromise - like for compatibility with servers that wouldn't know UTF-8 if they stepped in it.

thomasrutter
Yeah, it's odd behavior. IE(8) and Chrome both appear to always encode the same URL I used above as UTF-8 ... so I'm guessing maybe that UTF-8 encoding is actually their default encoding.But yeah, I was hoping there would be an easier fix but it looks like that might not be the case. Now the hard part will be to figure exactly how to validate UTF-8 encoding (in Java) :/ Thank you for your help!
JasonStoltz
It looks like there are actually a few reported bugs in Bugzilla representing this issue. https://bugzilla.mozilla.org/show_bug.cgi?id=461304, https://bugzilla.mozilla.org/show_bug.cgi?id=451359
JasonStoltz
I don't know much about Java, but according to Wikipedia, InputStreamReader and OutputStreamWriter classes support native UTF-8. You tell it to interpret as UTF-8 in the constructor, and then presumably if you get an exception, you catch it (and try another encoding).
thomasrutter
@thomasrutter Interesting idea, sounds like it should work. Perhaps I will give that a try. Thank you.
JasonStoltz
@thomasrutter Actually, I'm thinking I may be able to do something similar with URLDecoder. That may be more appropriate.
JasonStoltz
A: 

There are several parts in a url. The domain name is encoded according to the IDN (International Domain Names) rules (http://en.wikipedia.org/wiki/Internationalized_domain_name).

The part that you care about comes (usually) from a form. And the encoding of the source page determines the encoding (before the % escaping). The form element in html can also take an encoding attribute which overrides the the page setting.

So it is not the fault of Firefox, the encoding of the referrer page/form is the determining factor. And that is the standard behavior.

Mihai Nita
The thing is, when I submit it as a form, it IS encoded in UTF-8, which is happening as you state. My issue is that when the response comes back and the page is rendered, the query string parameter actually appears in it's un-encoded state in the address bar ... "q=Književni". When I then explicitly press enter on the address bar with that URL entered, it looks like it doesn't relate that URL to the current page (and hence the source encoding of that page), so it looks like it attempts to use the iso-8859-1 encoding.
JasonStoltz