views:

763

answers:

2

I have been tracking down a bug on a Url Rewriting application. The bug showed up as an encoding problem on some diacritic characters in the querystring.

Basically, the problem was that a request which was basically /search.aspx?search=heřmánek was getting rewritten with a querystring of "search=he%c5%99m%c3%a1nek"

The correct value (using some different, working code) was a rewrite of the querystring as "search=he%u0159m%u00e1nek"

Note the difference between the two strings. However, if you post both you'll see that the Url Encoding reproduces the same string. It's not until you use the context.Rewrite function that the encoding breaks. The broken string returns 'heÅmánek' (using Request.QueryString["Search"] and the working string returns 'heřmánek'. This change happens after the call to the rewrite function.

I traced this down to one set of code using Request.QueryString (working) and the other using Request.Url.Query (request.Url returns a Uri instance).

While I have worked out the bug there is a hole in my understanding here, so if anyone knows the difference, I'm ready for the lesson.

+2  A: 

Your question really sparked my interest, so I've done some reading for the past hour or so. I'm not absolutely positive I've found the answer, but I'll throw it out there to see what you think.

From what I've read so far, Request.QueryString is actually "a parsed version of the QUERY_STRING variable in the ServerVariables collection" [reference] , where as Request.Url is (as you stated) the raw URL encapsulated in the Uri object. According to this article, the Uri class' constructor "...parses the [url string], puts it in canonical format, and makes any required escape encodings."

Therefore, it appears that Request.QueryString uses a different function to parse the "QUERY_STRING" variable from the ServerVariables constructor. This would explain why you see the difference between the two. Now, why different encoding methods are used by the custom parsing function and the Uri object's parsing function is entirely beyond me. Maybe somebody a bit more versed on the aspnet_isapi DLL could provide some answers with that question.

Anyway, hopefully my post makes sense. On a side note, I'd like to add another reference which also provided for some very thorough and interesting reading: http://download.microsoft.com/download/6/c/a/6ca715c5-2095-4eec-a56f-a5ee904a1387/Ch-12_HTTP_Request_Context.pdf

regex
Both properties return the same encoded string most of the time - the constructors and parsing are irrelevant in this case. It's only after the rewrite call that the Uri's encoding changes.
zombat
+1  A: 

What you indicated as the "broken" encoded string is actually the correct encoding according to standards. The one that you indicated as "correct" encoding is using a non-standard extension to the specifications to allow a format of %uXXXX (I believe it's supposed to indicate UTF-16 encoding).

In any case, the "broken" encoded string is ok. You can use the following code to test that:

Uri uri = new Uri("http://www.example.com/test.aspx?search=heřmánek");
Console.WriteLine(uri.Query);
Console.WriteLine(HttpUtility.UrlDecode(uri.Query));

Works fine. However... on a hunch, I tried UrlDecode with a Latin-1 codepage specified, instead of the default UTF-8:

Console.WriteLine(HttpUtility.UrlDecode(uri.Query, 
           Encoding.GetEncoding("iso-8859-1")));

... and I got the bad value you specified, 'heÅmánek'. In other words, it looks like the call to HttpContext.RewritePath() somehow changes the urlencoding/decoding to use the Latin-1 codepage, rather than UTF-8, which is the default encoding used by the UrlEncode/Decode methods.

This looks like a bug if you ask me. You can look at the RewritePath() code in reflector and see that it is definitely playing with the querystring - passing it around to all kinds of virtual path functions, and out to some unmanaged IIS code.

I wonder if somewhere along the way, the Uri at the core of the Request object gets manipulated with the wrong codepage? That would explain why Request.Querystring (which is simply the raw values from the HTTP headers) would be correct, while the Uri using the wrong encoding for the diacriticals would be incorrect.

womp