tags:

views:

249

answers:

2

When parsing HTML for certain web pages (most notably, any windows live page) I encounter a lot of URL’s in the following format.

http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm

These appear to be partially UTF8 escaped strings (\x2f = /, \x3a=:, etc …). Is there a .Net API that can be used to transform these strings into a System.Uri? Seems easy enough to parse but I’m trying to avoid building a new wheel today.

A: 

Did you try HttpUtility.UrlDecode?

leppie
I had not tried that but it doesn't work.
JaredPar
+2  A: 

What you posted is not valid HTTP. As such, of course HttpUtility.UrlDecode() won't work. But irrespective of that, you can turn this back into normal text like this:

string input = @"http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm";
string output = Regex.Replace(input, @"\\x([0-9a-f][0-9a-f])",
    m => ((char) int.Parse(m.Groups[1].Value, NumberStyles.HexNumber)).ToString());

But notice that this assumes that the encoding is Latin-1 rather than UTF-8. The input you provided is inconclusive in that respect. If you need UTF-8 to work, you need a slightly longer route; you'll have to convert the string to bytes and replace the escape sequences with the relevant bytes in the process (probably needs a while loop), and then use Encoding.UTF8.GetString() on the resulting byte array.

Timwi