How do you think Google is handling this encoding issue?

I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.

For example, if you enter a URL directly into the address bar (I'm using Firefox 3.5.5) with a 'q' parameter, you will get the following results:

For the given query string parameter, this is how it's actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).

So, this really shouldn't be a big deal, right? Well, for me, not totally, but sort of. In the application I'm working on, we have a search box in our global navigation. When a user submits a search term in our search box, the 'q' parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.

However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.

As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:

http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni

So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?

Yeah, it's odd behavior. IE(8) and Chrome both appear to always encode the same URL I used above as UTF-8 ... so I'm guessing maybe that UTF-8 encoding is actually their default encoding.But yeah, I was hoping there would be an easier fix but it looks like that might not be the case. Now the hard part will be to figure exactly how to validate UTF-8 encoding (in Java) :/ Thank you for your help!

JasonStoltz 2009-11-20 13:51:12

It looks like there are actually a few reported bugs in Bugzilla representing this issue. https://bugzilla.mozilla.org/show_bug.cgi?id=461304, https://bugzilla.mozilla.org/show_bug.cgi?id=451359

JasonStoltz 2009-11-20 14:16:17

I don't know much about Java, but according to Wikipedia, InputStreamReader and OutputStreamWriter classes support native UTF-8. You tell it to interpret as UTF-8 in the constructor, and then presumably if you get an exception, you catch it (and try another encoding).

thomasrutter 2009-11-21 06:52:59

@thomasrutter Interesting idea, sounds like it should work. Perhaps I will give that a try. Thank you.

JasonStoltz 2009-11-24 20:19:25

@thomasrutter Actually, I'm thinking I may be able to do something similar with URLDecoder. That may be more appropriate.

JasonStoltz 2009-11-24 20:21:59

The thing is, when I submit it as a form, it IS encoded in UTF-8, which is happening as you state. My issue is that when the response comes back and the page is rendered, the query string parameter actually appears in it's un-encoded state in the address bar ... "q=Književni". When I then explicitly press enter on the address bar with that URL entered, it looks like it doesn't relate that URL to the current page (and hence the source encoding of that page), so it looks like it attempts to use the iso-8859-1 encoding.

JasonStoltz 2009-11-20 13:43:46

ansaurus

tags:

views:

answers:

How do you think Google is handling this encoding issue?

related questions