tags:

views:

175

answers:

3

I'm using asp.net/C# and I'm looking to create unique(?) uris for a small CMS system I am creating.

I am generating the uri segment from my articles title, so for example if the title is "My amazing article" the uri would be www.website.com/news/my-amazing-article

There are two parts to this. Firstly, which characters do you think I need to strip out? I am replacing spaces with "-" and I think I should strip out the "/" character too. Can you think of any more that might cause problems? "?" perhaps? Should I remove all non-alpha characters?

Second question, above I mentioned the uris MAY need to be unique. I was going to check the uri list before adding to ensure uniqueness, however I see stack overflow uses a number plus a uri. This I assume allows titles to be duplicated? Do you think this would be a better way?

+3  A: 

Transform all diacritics into their base character and then strip anything that is not a letter or a digit using Char.IsLetterOrDigit.

Then replace all spaces by a single dash.

This is what we use in our software.

/// <summary>
/// Convert a name into a string that can be appended to a Uri.
/// </summary>
private static string EscapeName(string name)
{
    if (!string.IsNullOrEmpty(name))
    {
        name = NormalizeString(name);

        // Replaces all non-alphanumeric character by a space
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < name.Length; i++)
        {
            builder.Append(char.IsLetterOrDigit(name[i]) ? name[i] : ' ');
        }

        name = builder.ToString();

        // Replace multiple spaces into a single dash
        name = Regex.Replace(name, @"[ ]{1,}", @"-", RegexOptions.None);
    }

    return name;
}

/// <summary>
/// Strips the value from any non english character by replacing thoses with their english equivalent.
/// </summary>
/// <param name="value">The string to normalize.</param>
/// <returns>A string where all characters are part of the basic english ANSI encoding.</returns>
/// <seealso cref="http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net"/&gt;
private static string NormalizeString(string value)
{
    string normalizedFormD = value.Normalize(NormalizationForm.FormD);
    StringBuilder builder = new StringBuilder();

    for (int i = 0; i < normalizedFormD.Length; i++)
    {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(normalizedFormD[i]);
        if (uc != UnicodeCategory.NonSpacingMark)
        {
            builder.Append(normalizedFormD[i]);
        }
    }

    return builder.ToString().Normalize(NormalizationForm.FormC);
}

Concerning using those generated name as unique Id, I would vouch against. Use the generated name as a SEO helper, but not as a key resolver. If you look at how stackoverflow references their pages:

http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net
                                   ^--ID  ^--Unneeded name but helpful for bookmarks and SEO

You can find the ID there. These two URL point to the same page:

http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net

http://stackoverflow.com/questions/249087/
Pierre-Alain Vigeant
Thanks for the interesting and useful code!I've also decided to go for the id and name combination, as you suggest and stackoverflow also implements.
DanDan
I don't like this "fake" SEO, where half the URL is meaningless. BTW you can also find this page at http://stackoverflow.com/questions/2095957/and-now-for-something-completely-different
DisgruntledGoat
Very interesting code. Thanks a lot for sharing :)
Felipe Lima
+2  A: 

You want to consult IETF RFC 3986, which describes URIs and what is legal and not legal.

Beyond validity, maybe you want a readable URI, as well. In that case eliminate all non-alphanumeric characters.

In stackoverflow, the title is changeable, hence the use of the ID for a unique yet unchanging distinguisher for the URI. If you don't have changeable titles, then you should be ok just using the text. If you can edit titles after publication, then an id may be preferable.

Cheeso
Thanks for the link.
DanDan
+1  A: 

For question 1: Rob Conery has a pretty useful Regex-based solution to cleaning strings for slug-generation. Here's the extension method (just add this to a static class):

public static string CreateSlug(this string source)
{
    var regex = new Regex(@"([^a-z0-9\-]?)");
    var slug = "";

    if (!string.IsNullOrEmpty(source))
    {
        slug = source.Trim().ToLower();
        slug = slug.Replace(' ', '-');
        slug = slug.Replace("---", "-");
        slug = slug.Replace("--", "-");
        if (regex != null)
            slug = regex.Replace(slug, "");

        if (slug.Length * 2 < source.Length)
            return "";

        if (slug.Length > 100)
            slug = slug.Substring(0, 100);
    }
    return slug;
}

For question 2, you could just place a UNIQUE constraint on the column in the database if you want them to be unique. This will allow you to trap the exception and provide useful user input. If you don't like that, then relying on the post identifier is probably a good alternative.

Scott Anderson
or , instead of trapping the exception, do a query on the URI-ified title, and if you get a result, then append a -1 to it, then -2, etc, until you don't find an entry in the DB. You still have to trap exceptions of course, but ideally you can be smarter about inserting into the db.
Cheeso