views:

540

answers:

4

After researching a bit how the different way people slugify titles, I've noticed that it's often missing how to deal with non english titles.

url encoding is very restrictive. See http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

So, for example how do folks deal with for title slugs for things like

"Una lágrima cayó en la arena"

One can come up with a reasonable table for indo european languages, ie. things that can be encoded via ISO-8859-1. For example, a conversion table would translate 'á' => 'a', so the slug would be

"una-lagrima-cayo-en-la-arena"

However, I'm using unicode (in particular using UTF-8 encoding), so no guaranties about what sort code points I'm going to get (I have to prepare for things that can't be ISO-8859-1 encoded.

I a nushell. How do deal with this? Should I come up with a conversion table for chars in the ISO_8859-1 range (<255) and drop everything else?

EDIT: To give a bit more context, a priori, I don't really expect to slugify data in non indo european languages, but I'd like to have a plan if I encounter such data. A conversion table for the extended ASCII would be nice. Any pointers?

Also, since people are asking, I'm using python, running on Google App Engine

+1  A: 

If all else fails, you could use a conversion table, but there might be a better performing solution available. What server side language are you using?

alex
I'm using python (site is running in google app engine)
bustrofedon
+2  A: 

In general this is going to depend on the language you expect to get. If your primary userbase is Japanese, dropping everything but ISO-8859-1 characters is unlikely to go over well.

That said, one option might be to use transliteration mode, if your character set conversion library supports it. For example, with GNU iconv, one can do:

] echo Una lágrima cayó en la arena|iconv -f utf8 -t ascii//TRANSLIT
Una lagrima cayo en la arena

As you can see, the accented characters were automatically converted to something in the ASCII range. How to translate this to code will of course depend on the language you're using, but if your language is based on GNU iconv for charset conversion (and if it's on linux, it probably is), this trick can probably be applied directly by simply specifying "ascii//TRANSLIT" as the convert-to character set.

One thing to note with this, however, is it's only effective with characters that "look like" something in ASCII. For example:

] echo 我輩は猫である。名前はまだない。|iconv -f utf8 -t ascii//TRANSLIT                                               
????????????????

As you can see, it's not much help for Japanese, and needs further processing afterward to remove characters not suitable for URLs.

bdonlan
Thanks, I did not know about iconv's ability to "transliterate". However, in the context of web applications, iconv isn't suitable. Let alone the fact that site is running on google app engine.
bustrofedon
+2  A: 

I simply use utf-8 for URL paths. As long as the domain is non-IDN FF3, IE works fine with this. Google reads and displays them correctly. The IRI RFC allows Unicode. Just make sure you parse the incoming urls correctly.

felixg
sure, but how do you do a RegEx for Unicode characters? [-\w] won't match **lágrima-cayó-en-la-arena**
kRON
I use .NET and it supports Unicode in RegEx. For javascript check jquery.validate and http://www.ibm.com/developerworks/web/library/wa-uri/index.html
felixg
+5  A: 

Nearly-complete transliteration table (for latin, greek and cyrillic character sets) can be found in slughifi library. It is geared towards Django, but can be easily modified to fit general needs (I use it with Werkzeug-based app on AppEngine).

zgoda
Thanks zgoda, in my situation (python, app engine), this beefed up slugify will do the job.
bustrofedon