views:

338

answers:

3

My problem is with .Net Http/Uri libraries not being able to decode or unescape this character sequence: "Hi%E1". Neither Uri.UnescapeDataString nor HttpUtility.UrlDecode can do it.

Although I have a solution to get around this problem ( http://stackoverflow.com/questions/1221849/url-decoding-confusion ) I would like to understand why it is failing.

The 1st test here throws an exception! The second just fails.

Assert.That(Uri.UnescapeDataString("Hi%E1"), Is.EqualTo("Hiá"));
HttpUtility.UrlDecode("Hi%E1").ShouldBe("Hiá");

There is nothing in the docs to indicate that UnescapeDataString or UrlDecode are restricted to character sets or any reason why these tests would fail. However, from testing, it would appear that HttpUtility assumes UTF-8 (or some other) encoding.

The Java equivalent works! Probably because it allows an encoding to be set.

URLDecoder.decode("Hi%E1","windows-1252");    // this works btw, ie passes tests

Which looks like a very sensible move considering the .Net work-around (see URL above)

Are the .Net implementations of these methods just crap and .Net devs just have to write their own - or am I missing something?

BTW Everything I know of in IIS set to UTF-8, and Chinese/Japanese characters show fine, so I don't yet know how it could it be that this URI consists of windows-1252 encoded characters. If I could fix the URI to contain UTF-8 encoding, that would be a better way of fixing this.

+1  A: 

According to this you can also set the encoding using the HttpUtility.UrlDecode.

Although, that seems to simple if you're running into problems... just making sure you saw the overload.

Jim Leonardo
I figured it'd be more complicated than that. Just wanted to make sure you weren't missing the obvious. We all do once in a while.
Jim Leonardo
Oops, no I didn't see the overlaod
PandaWood
+1  A: 

Seems to work as specified...

HttpUtility.UrlDecode("Hi%E1", System.Text.Encoding.GetEncoding("windows-1252"));

Edit: Answer to comment.

If you use Reflector on HttpUtility.UrlDecode(string) you see that it uses UTF8 as the default Encoding. (As it should.)

//From Reflector (System.Web)
public static string UrlDecode(string str)
{
    if (str == null)
    {
        return null;
    }
    return UrlDecode(str, Encoding.UTF8);
}
Jesper Palm
Yep, I didn't see the overload, you can also do this:HttpUtility.UrlDecode("Hi%E1", Encoding.GetEncoding(1252));The no-arg should be obselete, as whatever the default is, nobody knows!
PandaWood
Thank you, yes. And the real question for the underlying issue is, how did I manage to create data that was encoded in the non-default encoding...
PandaWood
+1  A: 

Addendum

I discovered the underlying issue to this problem. I was using 'escape' in javascript - it's deprecated, don't use it.

escape('á') returns '%E1' - which is a windows-1252 encoding (ie it will fail or return the wrong character when using the methods above eg HttpUtility.UrlDecode unless you are able to specify 'windows-1252' in the overload)

encodeURI('á') returns '%C3%A1' - which is a UTF-8 encoding. Which will work and all your troubles will go away. The methods above will work without throwing exceptions or producing the wrong character.

Dreaming: Wouldn't it be nice if the Uri.UnescapeDataString specified which escape character was the problem? My URI at the time of diagnosis was 23,000 characters long. "Invalid URI" is not such a helpful message in that scenario.

PandaWood
I'm more curious about how do you get a 23k char URI?
Davy8