views:

449

answers:

2

I'm about to setup permalinks on a website and I'm wondering what sort of conventions I should follow in regards to generating them from the existing table of news articles.

So far here are the rules/steps I've come up with:

  1. Create a dump file of my live database, set it up locally
  2. Add a new 'permalink' column to my news_articles table with ALTER table ADD column slug VARCHAR(100) ( Not sure on the data type or length, suggestions? )
  3. Update every row in my news_articles table, such that

the permalink column by default is replaced by the title, after the title goes through some formatting:

Disclaimer: the titles are Russian Cyrillic and not English so I don't think I can get away with an easy regex replacement, unless perhaps I rely on unicode code ranges except I'm not familiar with the range of the Cyrillic characters so I'll do manual replacement and the rules I've defined so far are:

  • replace all spaces with an underscore
  • strip off leading/trailing commas and periods
  • remove all single/double quotes
  • remove all question marks
  • & becomes the russian word for 'and'

I'd appreciate any advice on my current strategy before I actually execute it, as there are thousands of articles.

+2  A: 

Some comments/suggestions

  • replace all spaces with an underscore (how about hypen instead of underscore, like tags in SO)
  • strip off leading/trailing commas and periods (leading/trailing spaces too)
  • remove all single/double quotes (starting and trailing ones, if not replace with hyphens)
  • remove all question marks (same as above and all other punctuation characters like “ ” (left/right double quotes) and symbols like % and = etc etc)
  • & becomes the russian word for 'and' (Do you mean и? has to be readable without need for url encoding)

Update:
* convert all characters to lower case (kinda subjective but it's a common practice and more readable then all upper case)

One more suggestion, this is a trick which I used to check if my url path or slug do not contain any urlencoded string. After formulated the slug string, I will perform an urlencode and compare with the pre-encoded one. They should be the same.

Example for this post, if the slug happened to be "permalink-slug:best-practices", the urlencoded one will be "permalink-slug%3Abest-practices", not really readable and defeats the purpose of slug which you will know/agree. So it's really easy to know the orginal string don't work because it is different after urlencoding.

o.k.w
Thanks for the recommendations. Do you do the slug generation with server-side? And could you describe how you detect for those types of characters in the urlencoded version? Do you have an array/regex character class of stuff like '%3A' or is there a better method of detecting?
meder
Off my head, what I can think of is to iterate through each character in the formulated slug string (before urlencoding) and check the characters belong to the url accepted character set. See http://www.w3.org/Addressing/URL/uri-spec.html for acceptable characters.
o.k.w
Ok, I'll do this on the server-side then. Thank you.
meder
Hey, you are the same meder on the other question post on fancyzoom! Haha, you are welcomed!
o.k.w
Yep, only one meder I know of around these parts :p
meder
+2  A: 

I think the problem is many programs not allow for unicode url's... Wikipedia uses these characters in the text.. in the browser they look good but when sending the link they become very long like...

http://th.wikipedia.org/wiki/%E0%B8%AA%E0%B9%89%E0%B8%A7%E0%B8%A1%E0%B9%83%E0%B8%99%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2

I think the only good solution is only convert to A-Z alphabet.. and other languages use an ID only..

If anybody has a good solution for this I would love to hear about it :)

Chris