ansaurus

Question

Answer 1

+1 A:

If all else fails, you could use a conversion table, but there might be a better performing solution available. What server side language are you using?

alex 2009-05-05 01:01:42

I'm using python (site is running in google app engine)

bustrofedon 2009-05-05 08:38:50

Answer 2

+2 A:

In general this is going to depend on the language you expect to get. If your primary userbase is Japanese, dropping everything but ISO-8859-1 characters is unlikely to go over well.

That said, one option might be to use transliteration mode, if your character set conversion library supports it. For example, with GNU iconv, one can do:

] echo Una lágrima cayó en la arena|iconv -f utf8 -t ascii//TRANSLIT
Una lagrima cayo en la arena

As you can see, the accented characters were automatically converted to something in the ASCII range. How to translate this to code will of course depend on the language you're using, but if your language is based on GNU iconv for charset conversion (and if it's on linux, it probably is), this trick can probably be applied directly by simply specifying "ascii//TRANSLIT" as the convert-to character set.

One thing to note with this, however, is it's only effective with characters that "look like" something in ASCII. For example:

] echo 我輩は猫である。名前はまだない。|iconv -f utf8 -t ascii//TRANSLIT                                               
????????????????

As you can see, it's not much help for Japanese, and needs further processing afterward to remove characters not suitable for URLs.

bdonlan 2009-05-05 01:27:51

Thanks, I did not know about iconv's ability to "transliterate". However, in the context of web applications, iconv isn't suitable. Let alone the fact that site is running on google app engine.

bustrofedon 2009-05-05 08:56:56

Answer 3

+2 A:

I simply use utf-8 for URL paths. As long as the domain is non-IDN FF3, IE works fine with this. Google reads and displays them correctly. The IRI RFC allows Unicode. Just make sure you parse the incoming urls correctly.

felixg 2009-05-05 09:58:39

sure, but how do you do a RegEx for Unicode characters? [-\w] won't match **lágrima-cayó-en-la-arena**

kRON 2009-05-06 11:00:13

I use .NET and it supports Unicode in RegEx. For javascript check jquery.validate and http://www.ibm.com/developerworks/web/library/wa-uri/index.html

felixg 2009-05-06 16:52:50

Answer 4

+5 A:

Nearly-complete transliteration table (for latin, greek and cyrillic character sets) can be found in slughifi library. It is geared towards Django, but can be easily modified to fit general needs (I use it with Werkzeug-based app on AppEngine).

zgoda 2009-05-05 13:21:26

Thanks zgoda, in my situation (python, app engine), this beefed up slugify will do the job.

bustrofedon 2009-05-05 15:04:44

ansaurus

tags:

views:

answers:

rules for slugs and unicode

related questions