views:

1094

answers:

8

I am writing a web application that requires friendly urls, but I'm not sure how to deal with non 7bit ASCII characters. I don't want to replace accented characters with URL encoded entities either. Is there a C# method that allows this sort of conversion or do I need to actually map out every single case I want to handle?

+2  A: 

I don't know how to do it in C#, but the magic words you want are "Unicode decomposition". There's a standard way to break down composed characters like "é", and then you should be able to just filter out the non-ASCII ones.

Edit: this might be what you're looking for.

Ken
A: 

well there's an easy why I think, there are not much of these characters, you can replace those in the string very easy by using Replace() method of the string class.

Pooria
+1  A: 

There is something similar on: http://stackoverflow.com/questions/266719/url-routing-handling-spaces-and-illegal-characters-when-creating-friendly-urls

Nevertheless, I don't recommend auto conversion. Some words can change meaning when doing these type of changing. You can turn a nice word into an inappropriate word.

eglasius
Thanks for the link. I couldn't find anything in my searchs.
Scott Muc
+1  A: 

This link might help: http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx

private string LatinToAscii(string InString)
{
string newString = string.Empty, charString;
char ch;
int charsCopied;

for (int i = 0; i < InString.Length; i++)
{
    charString = InString.Substring(i, 1);
    charString = charString.Normalize(NormalizationForm.FormKD);
    // If the character doesn't decompose, leave it as-is

    if (charString.Length == 1)
        newString += charString;
    else
    {
        charsCopied = 0;
        for (int j = 0; j < charString.Length; j++)
        {
            ch = charString[j];
            // If the char is 7-bit ASCII, add

            if (ch < 128)
            {
                newString += ch;
                charsCopied++;
            }
        }
        /* If we've decomposed non-ASCII, give it back
         * in its entirety, since we only mean to decompose
         * Latin chars.
        */
        if (charsCopied == 0)
            newString += InString.Substring(i, 1);
    }
}
return newString;
}
Patrick McDonald
A: 

http://Montréal.com

(copy/paste in browser, it works?)

Ape-inago
Unicode characters in the domain name work differently from in path/query parts, they're encoding using the “punycode” rules of IDN.
bobince
+2  A: 

Use UTF-8:

Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent- encoded to be represented as URI characters. — RFC 3986

Gumbo
+1. It's perfectly allowable to have non-ASCII characters in path parts; you hex-encode their UTF-8 bytes and the browser displays the Unicode version in the address bar. See Wikipedia for somewhere this works well.
bobince
Even though his second sentence was "I don't want to replace accented characters with URL encoded entities", you tell him to do something that "must be percent-encoded to be represented as URI"? What we have here is a failure to communicate.
Ken
I think he assumes that such endoded words are displayed as `%xx` and not the characters it represents. But this is only the case it the words are not UTF-8 encoded.
Gumbo
You don't *have* to hex-encode; you can use an ‘IRI’ (URI with plain unescaped Unicode characters) and it'll work as the same URI in browsers; just escaping is more historically reliable. It's arguable what “URL encoded entities” is supposed to mean; visible %-escaping? HTML entity references?
bobince
+1  A: 

Ok -- there are some good answers here. Those methods would work. However, I have to question your basic premise. I presume that these values that you are discussing are basically to be querystring parameters, yes? That's the most common reason to have to filter out special characters.

For two or three years, I used a string encoding/decoding approach to pass stuff like this through querystring. There were always intermittent problems, because -- darn it -- there are just so many different possible special characters, and issues in one browser vs another, etc. Our methods weren't as sophisticated as those outlined here, but still. In 2005, during a rewrite of much of the system I was working on, we decided to move to only ever passing id values through querystring. That approach has worked extremely well, and I can't think of any drawbacks to it. If you have a database back-end, you already have an id attached to pretty much every string, anyway. If this is for searches or the like, you can always send it via form post -- or you can use an AJAX solution that doesn't require you to load another page in the first place.

Those methods aren't going to be the best for every situation -- there is no magic bullet here any more than anywhere else -- but this approach has been simple and very functional for me and my team, and so I think it's something for you to at least consider.

x4000
They won't be querystring variables. I will be making URLs of the form : http:/server/name/of-montrealand I want that url slug "of-montreal" to be automatically generated by the value "Of Montréal".In the cases where things get translated poorly there will always be a manual override.
Scott Muc
Then you're definitely on track with the suggestions from the others. It sounds like you will be able to just generate these once and then store them in a database, which is even better -- having to encode/decode in realtime is less efficient.
x4000