A: 

As a follow-up. I do have some ideas. So feel free to comment on the ideas or give your own answer to the question:

Solution #1: Replace all illegal characters with dashes:

  • www.mysite.com/diseases---conditions/Auto-immune-disorders/the--1-killer-of-people-is-some-disease/

That looks a little ugly to me...

Solution #2: Strip illegal characters and replace spaces with single dashes:

  • www.mysite.com/diseases-conditions/Auto-immune-disorders/the-1-killer-of-people-is-some-disease/

Solution #3 Apply a few rules to replace certain characters with words:

  • www.mysite.com/diseases-and-conditions/Auto-immune-disorders/the-number1-killer-of-people-is-some-disease/

Solution #4 Strip All Spaces and use Capitalization

  • www.mysite.com/DiseasesAndConditions/AutoImmuneDisorders/TheNumber1KillerOfPeopleIsSomeDisease/

(May not work well on case sensitive servers and is hard to read)

Atømix
A: 

Solution 2 would be my recommendation. I'm not the worlds biggest SEO expert, but I believe it's pretty much the 'standard' way to get good rankings anyway.

da5id
A: 

What I do normally is to allow only legal character and keep the friendly URL as short as possible. Also important is that friendly URLs are often inserted by human, I never generate a friendly URL from title or content, and then use that one to query the database. I would use a column in a table e.g. friendly_url, so that the website admin can insert friendly URLs.

Arief Iman Santoso
+3  A: 

My last approach is:

  1. Convert all "strange letters" to "normal letters" -> à to a, ñ to n, etc.
  2. Convert all non-word characters to _ (i.e not a-zA-Z0-9)
  3. replace groups of underscores with a single underscore
  4. remove all tailing and leading underscores

As for storage, I believe the friendly URL should go to the database, and be immutable, after all cool URIs don't change

alex
A: 

I solved this problem by adding an additional column in the database (e.g: UrlTitle alongside the Title column) and saving a title stripped of all illegal characters with '&' symbols replaced with 'and', and spaces replaced by underscores. Then you can lookup via the UrlTitle and use the real one in the page title or wherever.

Nick
+1  A: 

Solution 2 is the typical approach of those... some refinements are possible, eg. turning apostrophes into nothing instead of a dash, for readability. Typically you will want to store the munged-for-URL-validity version of the title in the database as well as the ‘real’ title, so you can select the item using an indexed SELECT WHERE.

However. There is no actual illegal character in a URL path part, as long as you encode it appropriately. For example a space, hash or slash can be encoded as %20, %23 or %2F. This way it is possible to encode any string into a URL part, so you can SELECT it back out of the database by actual, unchanged title.

There are a few potential problems with this depending on your web framework though. For example anything based on CGI will be unable to tell the difference between an encoded %2F and a real /, and some frameworks/deployments can have difficulty with Unicode characters.

Alternatively, a simple and safe solution is to include the primary key in the URL, using the titled parts purely for making the address nicer. eg.:

http://www.example.com/x/category-name/subcat-name/article-name/348254863

This is how eg. Amazon does it. It does have the advantage that you can change the title in the database and have the URL with the old title redirect automatically to the new one.

bobince
Good points, you've got to strike a balance between encoding illegal characters verses removing them for user friendliness. It's not only Amazon adding the PK to the url - Stack Overflow does it too :)
Nick
I really like the idea of using the primary key. That's what I was passing with my query strings before anyhow.
Atømix
A: 

I suggest doing what wordpress does - strip out small words and replce illegal characters with dashes (max 1 dash) then let the user correct the URL if they want to. It better for SEO to make the URL configurable.

+2  A: 

I myself prefer _ to - for readability reasons ( you put an underline on it and the _'s virtually go_away ) , if you're going to strip spaces.

You may want to try casting extended characters, ie, ü , to close-ascii equivelants where possible, ie:

ü -> u

However, in my experience the biggest problem with Actual SEO related issues, is not that the URL contains all the lovely text, its that when people change the text in the link, all your SEO work turns to crap because you now have DEADLINKS in the indexes.

For this, I would suggest what stackoverflow do, and have a numeric part which references a constant entity, and totally ignore the rest of the text ( and/or update it when its wrong )

Also, the grossly hericichial nature just makes for bad usability by humans. Humans hate long urls. Copy pasting them sucks and they're just more prone to breaking. If you can subdivide it into lower teirs, ie

/article/1/Some_Article_Title_Here
/article/1/Section/5/Section_Title_Here
/section/19023/Section_Title_here  ( == above link )

That way the only time you need to do voodoo magic is when the numbered article actually has been deleted, at which time you use the text part as a search string to try find the real article or something like it.

Kent Fredric
Good idea, but an underscore looks like a space in an underlined link, so you can have problems there. The other suggestion looks good though.
Atømix
This is the method that looks like it has the most flexibility. I've already tested it and it seems to work well. The TItle is ignored and only the IDs are used.
Atømix
+1  A: 

In case anyone is interested. This is the route (oooh... punny) I'm taking:

Route r = new Route("{country}/{lang}/Article/{id}/{title}/", new NFRouteHandler("OneArticle"));
Route r2 = new Route("{country}/{lang}/Section/{id}-{subid}/{title}/", new NFRouteHandler("ArticlesInSubcategory"));
Route r3 = new Route("{country}/{lang}/Section/{id}/{title}/", new NFRouteHandler("ArticlesByCategory"));

This offers me the ability to do urls like so:

  • site.com/ca/en/Article/123/my-life-and-health
  • site.com/ca/en/Section/12-3/Health-Issues
  • site.com/ca/en/Section/12/
Atømix
+1  A: 

When cleaning URLs, here's a method I'm using to replace accented characters:

private static string anglicized(this string urlpart) {
        string before = "àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ";
        string  after = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";

        string cleaned = urlpart;

        for (int i = 0; i < avantConversion.Length; i++ ) {

            cleaned = Regex.Replace(urlpart, before[i].ToString(), after[i].ToString());
        }

        return cleaned;

        // Here's some for Spanish : ÁÉÍÑÓÚÜ¡¿áéíñóúü"

}

Don't know if it's the most efficient Regex, but it is certainly effective. It's an extension method so to call it you simply put the method in a Static Class and do somthing like this:

string articleTitle = "My Article about café and the letters àâäá";
string cleaned = articleTitle.anglicized();

// replace spaces with dashes
cleaned = Regex.Replace( cleaned, "[^A-Za-z0-9- ]", "");

// strip all illegal characters like punctuation
cleaned = Regex.Replace( cleaned, " +", "-").ToLower();

// returns "my-article-about-cafe-and-the-letters-aaaa"

Of course, you could combine it into one method called "CleanUrl" or something but that's up to you.

Atømix
For a more complete, fully Unicode compatible version - http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net
devstuff
A: 

As a client user, not a Web designer, I find Firefox sometimes breaks the URL when it tries to replace "illegal" characters with usable ones. For example, FF replaces ~ with %7E. That never loads for me. I can't understand why the HTML editors and browsers don't simply agree not to accept characters other than A-Z and 0-9. If certain scripts need %, ?, and such, change the scripting applications so they will work with alpha numeric.

Well, unfortunately, computer programs need to be as generic as possible to be the most useful... or to be "programmable" and that means that programs need to accept any input that you throw at it.
Atømix