ansaurus

Question

Making strings URL Friendly (eg: convert Montréal to Montreal)

Answer 1

+2 A:

I don't know how to do it in C#, but the magic words you want are "Unicode decomposition". There's a standard way to break down composed characters like "é", and then you should be able to just filter out the non-ASCII ones.

Edit: this might be what you're looking for.

Ken 2009-03-04 18:18:05

Answer 2

A:

well there's an easy why I think, there are not much of these characters, you can replace those in the string very easy by using Replace() method of the string class.

Pooria 2009-03-04 18:20:47

Answer 3

+1 A:

There is something similar on: http://stackoverflow.com/questions/266719/url-routing-handling-spaces-and-illegal-characters-when-creating-friendly-urls

Nevertheless, I don't recommend auto conversion. Some words can change meaning when doing these type of changing. You can turn a nice word into an inappropriate word.

eglasius 2009-03-04 18:22:31

Thanks for the link. I couldn't find anything in my searchs.

Scott Muc 2009-03-04 18:50:40

Answer 4

+1 A:

This link might help: http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx

private string LatinToAscii(string InString)
{
string newString = string.Empty, charString;
char ch;
int charsCopied;

for (int i = 0; i < InString.Length; i++)
{
    charString = InString.Substring(i, 1);
    charString = charString.Normalize(NormalizationForm.FormKD);
    // If the character doesn't decompose, leave it as-is

    if (charString.Length == 1)
        newString += charString;
    else
    {
        charsCopied = 0;
        for (int j = 0; j < charString.Length; j++)
        {
            ch = charString[j];
            // If the char is 7-bit ASCII, add

            if (ch < 128)
            {
                newString += ch;
                charsCopied++;
            }
        }
        /* If we've decomposed non-ASCII, give it back
         * in its entirety, since we only mean to decompose
         * Latin chars.
        */
        if (charsCopied == 0)
            newString += InString.Substring(i, 1);
    }
}
return newString;
}

Patrick McDonald 2009-03-04 18:24:09

Answer 5

A:

http://Montréal.com

(copy/paste in browser, it works?)

Ape-inago 2009-03-04 18:26:00

Unicode characters in the domain name work differently from in path/query parts, they're encoding using the “punycode” rules of IDN.

bobince 2009-03-04 18:32:08

Answer 6

+2 A:

Use UTF-8:

Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent- encoded to be represented as URI characters. — RFC 3986

Gumbo 2009-03-04 18:26:55

+1. It's perfectly allowable to have non-ASCII characters in path parts; you hex-encode their UTF-8 bytes and the browser displays the Unicode version in the address bar. See Wikipedia for somewhere this works well.

bobince 2009-03-04 18:33:32

Even though his second sentence was "I don't want to replace accented characters with URL encoded entities", you tell him to do something that "must be percent-encoded to be represented as URI"? What we have here is a failure to communicate.

Ken 2009-03-04 18:50:08

I think he assumes that such endoded words are displayed as `%xx` and not the characters it represents. But this is only the case it the words are not UTF-8 encoded.

Gumbo 2009-03-04 18:57:06

You don't *have* to hex-encode; you can use an ‘IRI’ (URI with plain unescaped Unicode characters) and it'll work as the same URI in browsers; just escaping is more historically reliable. It's arguable what “URL encoded entities” is supposed to mean; visible %-escaping? HTML entity references?

bobince 2009-03-05 01:32:51

Answer 7

+1 A:

Ok -- there are some good answers here. Those methods would work. However, I have to question your basic premise. I presume that these values that you are discussing are basically to be querystring parameters, yes? That's the most common reason to have to filter out special characters.

For two or three years, I used a string encoding/decoding approach to pass stuff like this through querystring. There were always intermittent problems, because -- darn it -- there are just so many different possible special characters, and issues in one browser vs another, etc. Our methods weren't as sophisticated as those outlined here, but still. In 2005, during a rewrite of much of the system I was working on, we decided to move to only ever passing id values through querystring. That approach has worked extremely well, and I can't think of any drawbacks to it. If you have a database back-end, you already have an id attached to pretty much every string, anyway. If this is for searches or the like, you can always send it via form post -- or you can use an AJAX solution that doesn't require you to load another page in the first place.

Those methods aren't going to be the best for every situation -- there is no magic bullet here any more than anywhere else -- but this approach has been simple and very functional for me and my team, and so I think it's something for you to at least consider.

x4000 2009-03-04 18:47:18

They won't be querystring variables. I will be making URLs of the form : http:/server/name/of-montrealand I want that url slug "of-montreal" to be automatically generated by the value "Of Montréal".In the cases where things get translated poorly there will always be a manual override.

Scott Muc 2009-03-04 18:55:45

Then you're definitely on track with the suggestions from the others. It sounds like you will be able to just generate these once and then store them in a database, which is even better -- having to encode/decode in realtime is less efficient.

x4000 2009-03-04 19:26:44

Answer 8

A:

Duplicate: How to get rid of diacritics/accented letters in C#

Serge - appTranslator 2009-03-07 20:12:40

ansaurus

tags:

views:

answers:

Making strings URL Friendly (eg: convert Montréal to Montreal)

related questions