views:

815

answers:

6

I'm trying to URL-escape (percent-encode) non-ascii characters in several URLs I'm dealing with. I'm working with a flash application that loads resources like images and sound clips from these URLs. Since the filenames can contain non-ascii characters, like so: 日本語.jpg I escape them by utf-8 encoding the characters, and then percent-escaping the unicode bytes, to get the following:

%E6%97%A5%E6%9C%AC%E8%AA%9E.jpg

These filenames work fine when I run the app in any browser other than Internet Explorer - I've tried Firefox, Safari and Chrome. But when I launch the app in IE (tried both 6 and 8) and it tries to load the sound clip, I get: Error #2044: Unhandled ioError, and the URL has been corrupted to something like:

日本語.jpg

Any thoughts on how to fix this? This is just test-driving the flash app with local filesystem URLs. I've also noticed that Internet explorer isn't able to locate a file such as: file:///C:/%E6%97%A5%E6%9C%AC%E8%AA%9E.jpg, though Chrome / Firefox will decode it and load just fine for a file with the path

C:\日本語.jpg

edit

I think my problem is the same as the one encountered in the following ActionScript code fragment:

import flash.display.Loader;
import flash.net.URLRequest;
...
var ldr:Loader;
var req:URLRequest = new URLRequest("日本語.jpg");
ldr = new Loader();
ldr.load(req);

Using the string 日本語.jpg will work in IE, while using the string %E6%97%A5%E6%9C%AC%E8%AA%9E.jpg works in other browsers. What I need is a single form that will work in all browsers. I have tried the %u encoding and setting the http request header to Content-Type: text/html; charset=utf-8 with no luck in either percent-escaped or unescaped form.

+1  A: 

Sorry, no solution, but maybe at least some more information about what might be going on here. (Probably you've already figured this much out, but maybe it will help another reader find a solution.) The "official" url encoding specification seems to leave the door wide open as to how to decode escaped urls like the ones you are generating--are the escaped entities intended to represent UTF-8 characters (as Firefox, etc. are interpretting them) or ASCII characters (as IE is interpretting them)? I don't know of any way to force the intended decoding strategy.

Just a question: what bad thing is happening if you do not escape them at all, but leave the unicode in the url? Although I don't have a lot of experience with it, I thought I remember reading somewhere that the days of needing to escape unicode in urls are behind us. Could be wrong about that...

Dave
Most browsers seem ok with urls containing unicode characters. I'm building a Flex application, though, and my urls are links to external assets like sound clips, images, movies, etc. When I run the compiled .swf in the flash plug-in, these assets only load if unicode characters are url / percent escaped UTF-8. Otherwise they just fail to load. These percent-escaped filenames are working fine in every browser except Internet Explorer.
Bear
URI/URL (RFC 3986) requires encoding of non-ASCII characters. IRI (RFC 3987), on the other hand, allows most Unicode characters unencoded. IRI is the new standard that replaces the old URI/URL standard, but many systems do not implement IRI yet. The IRI specification does provide rules for converting an IRI to a URI/URL and vice versa.
Remy Lebeau - TeamB
+1  A: 
JasonTrue
+1  A: 

Try encoding only the parts of the URI that would cause it to be parsed incorrectly. For instance, encode &, ?, and space. Leave everything else as is, and it should work like a charm.

If you are still running into problems, You may need to set the content-type to utf in your http headers. Something like Content-type: text/html; charset=UTF-8.

Bear
Unfortunately, the framework I am working with - Flex - doesn't handle unescaped, non-ascii characters particularly well. I need to find if there is a proper way around this. I will dig around in the Flex framework to see if it is possible to access the HTTP headers, but I was hoping for a higher level solution.
Bear
+1  A: 

Why not just use Unicode escape sequences? Paste this into a the body of an HTML web page to see what I mean:

   <script type="text/javascript">
      var fileName = "日本語.jpg";
      document.write(escape(fileName));
   </script>

I get %u65E5%u672C%u8A9E.jpg.

Ishmael
These unfortunately don't work for me. Is this a standard way of escaping URLs? Firefox was unable to load a URL of the form:`file:///.../%u3400.jpg`, for a file named `㐀.jpg` on the given path.
Bear
Sorry, I guess just works for JavaScript escape/unescape. I tried your encoding, and it works for my localhost. As is mentioned elswhere, you may need to tell the server you are sending UTF-8 in a header.
Ishmael
If your host page has an encoding meta tag, that should do for convincing the server you are speaking UTF-8. I would think. Maybe.
Ishmael
+1  A: 

From what i've tested, I noticed IE doesn't treat encoded file URLs but it does treat normal http URLs, so that could be the issue. I'm not sure how you are loading them, but you should check out that issue.

Malcolm Lim
This turns out to be the issue. The flash active-x control (IE) only loads unencoded file URLs, whereas the flash plug-in (Chrome, Firefox, Safari, etc) will only load encoded file URLs. The only workaround I've been able to think of so far is: if Flash player is active-x use unencoded url else use url-encoded urlkinda hacky if you ask me.
Bear
+1  A: 

file:// protocol depends on your OS region settings, if your system settings doesn't set to chinese but english, you can't let IE do this.

Weixiao.Fan