views:

1059

answers:

3

I'm working on a site which the client has had translated into Croatian and Slovenian. In keeping with our existing URL patterns we have generated URL re-writing rules that mimic the layout of the application which has lead to having many non-ascii charachters in the URLs.

Examples š ž č

Some links are triggered from Flash using getURL, some are standard HTML links. Some are programatic Response.Redirects and some through adding 301 status codes and location headers to the response. I'm testing in IE6, IE7 and Firefox 3 and internitmtently, the browsers display the non-latin chars url encoded.

š = %c5%a1
ž = %c5%be
č = %c4%8d

I'm guessing this is something to do with IIS and the way it handles Response.Redirect and AddHeader("Location ...

Does anyone know of a way of forcing IIS to not URL encode these chars or is my best bet to replace these with non-diacritic chars?

Thanks

A: 

Those characters should be valid in a URL. I did the URL SEO stuff on a large travel site and that's when I learnt that. When you force diacritics to ascii you can change the meaning of words if you're not careful. There often is no translation as diacritics only exist in their context.

Rimian
Hi, yeah i'm aware they are valid URLs i'm just trying to get a consistent output to the end user
Greg B
+4  A: 

Ask yourself if you really want them non-url encoded. What happens when a user that does not have support for those characters installed comes around? I have no idea, but I wouldn't want to risk making large parts of my site unavailable to a large part of the world's computers...

Instead, focus on why you need this feature. Is it to make the urls look nice? If so, using a regular z instead of ž will do just fine. Do you use the urls for user input? If so, url-encode everything before parsing it to link output, and url-decode it before using the input. But don't use ž and other local letters in urls...

As a side note, in Sweden we have å, ä and ö, but no one ever uses them in urls - we use a, a and o, because browsers won't support the urls otherwise. This doesn't surprise the users, and very few are unable to understand what words we're aiming at just because the ring in å is missing in the url. The text will still show correctly on the page, right? ;)

Tomas Lycken
Yes the copy will still display properly
Greg B
Then use "standard" utf-8 letters - your croatian and slovenian customers will be able to read the urls even without the little "up-side-down roof" over z in ž...
Tomas Lycken
Thanks Thomas, After speaking with the client, we've decided that removing the diacritics is the easiest and most reliable course of action.
Greg B
Incidentally, if you want to see Unicode URLs Done Right, take a look at Wikipedia.
bobince
+2  A: 

Does anyone know of a way of forcing IIS to not URL encode

You must URL-encode. Passing a raw ‘š’ (\xC5\xA1) in an HTTP header is invalid. A browser might fix the error up to ‘%C5%A1’ for you, but if so the result won't be any different to if you'd just written ‘%C5%A1’ in the first place.

Including a raw ‘š’ in a link is not wrong as such, the browser is supposed to encode it to UTF-8 and URL-encode as per the IRI spec. But to make sure this actually works you should ensure that the page with the link in is served as UTF-8 encoded. Again, manual URL-encoding is probably safest.

I've had no trouble with UTF-8 URLs, can you link to an example that is not working?

do you have a link to a reference where it details what comprises a valid HTTP header?

Canonically, RFC 2616. However, in practice it is somewhat unhelpful. The critical passage is:

Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047.

The problem is that according to the rules of RFC 2047, only ‘atoms’ can accommodate a 2047 ‘encoded-word’. TEXT, in most situations it is included in HTTP, cannot be contrived to be an atom. Anyway RFC 2047 is explicitly designed for RFC 822-family formats, and though HTTP looks a lot like an 822 format, it isn't in reality compatible; it has its own basic grammar with subtle but significant differences. The reference to RFC 2047 in the HTTP spec gives no clue for how one might be able to interpret it in any consistent way and is, as far as anyone I know can work out, a mistake.

In any case no actual browser attempts to find a way to interpret RFC 2047 encoding anywhere in its HTTP handling. And whilst non-ASCII bytes are defined by RFC 2616 to be in ISO-8859-1, in reality browsers can use a number of other encodings (such UTF-8, or whatever the system default encoding is) in various places when handling HTTP headers. So it's not safe to rely even on the 8859-1 character set! Not that that would have given you ‘š’ anyhow...

bobince
Hi bobince, do you have a link to a reference where it details what comprises a valid HTTP header. Thanks
Greg B