tags:

views:

334

answers:

4

Hello, can anyone tell me, where can I find translation table for all world language letter, including russia, greek, thai etc? I need a function to create fancy url from text in any language. And, because we know nothing about for example japanese, I am trying this way. Thanks for you replies

A: 

Didn't understand your question correctly. Are you looking for something like this?

http://www.joelonsoftware.com/articles/Unicode.html

Carlos Lima
A: 

You can always try to convert the text into iso-8859-1 (using for example iconv easily if it is in php) and then simply replace spaces and all those bad characters that are valid in iso-8859-1 but not in URL ;-)

RomanT
No, you can't. Because the original states that the input will be from arbitrary scripts, i.e., the text will have codepoints that aren't encoded in latin-1.
pat
+3  A: 

Sounds like what you want is a transliteration table. Try some of the links on that page. If you want it only for HTTP URLs, have a look at percent-encoding.

l0b0
+2  A: 

Transliteration in general is non-trivial, see the Unicode Transliteration Guidelines. The answer to your question, bluntly, is that the table you're looking for doesn't exist.

That said, there are a few work-arounds available, like Sean M. Burke's Unidecode Perl module (and ports to Ruby Python). But as he points out, you're not going to transliteration for, say, Thai or Japanese that's usefully readable from such conversion.

Take a look at the following test session using the Python port:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from unidecode import unidecode

hello = u"""Hello world! English 
Salut le monde! French 
Saluton Mondo! Esperanto
Sveika, pasaule! Latvian
Tere, maailm! Estonian
Merhaba dünya! Turkish 
Olá mundo! Portuguese
안녕, 세상! Korean
你好,世界! Chinese
こんにちは 世界! Japanese
ሠላም ዓለም! Amharic
哈佬世界! Cantonese
Привет, мир! Russian
Καλημέρα κόσμε! Greek
สวัสดีราคาถูก! Thai"""

lines = hello.splitlines()
samples = []

for line in lines:
  language, text = line.split()[-1], ' '.join(line.split()[:-1])
  samples.append( (language, text) )

for language, text in samples:
  print language.upper()
  print text
  print unidecode(text)
  print

Which outputs:

ENGLISH
Hello world!
Hello world!

FRENCH
Salut le monde!
Salut le monde!

ESPERANTO
Saluton Mondo!
Saluton Mondo!

LATVIAN
Sveika, pasaule!
Sveika, pasaule!

ESTONIAN
Tere, maailm!
Tere, maailm!

TURKISH
Merhaba dünya!
Merhaba dunya!

PORTUGUESE
Olá mundo!
Ola mundo!

KOREAN
안녕, 세상!
annyeong, sesang!

CHINESE
你好,世界!
Ni Hao ,Shi Jie !

JAPANESE
こんにちは 世界!
konnitiha Shi Jie !

AMHARIC
ሠላም ዓለም!
szalaame `aalame!

CANTONESE
哈佬世界!
Ha Lao Shi Jie !

RUSSIAN
Привет, мир!
Priviet, mir!

GREEK
Καλημέρα κόσμε!
Kalemera kosme!

THAI
สวัสดีราคาถูก!
swasdiiraakhaathuuk!

For languages that are Latin-ish in the first place, it's quite useful: it strips accent marks. Outside of those, things get dicey fast.

If you compare the Chinese and Japanese examples, you'll see that the sequence 世界 is transliterated Shi Jie in both. That's wrong -- the "transliteration" (or better, "reading") of the Japanese should be seikai. The Russian and Greek are not too bad. But Amharic and Thai are abysmal--I would guess that they're not even legible to someone who's fluent in those languages.

The general problem here is that transliteration is not something that can be defined unless language-specific information is also taken into account, and even determining language is non-trivial: how is your program supposed to know if 世界 is in Japanese or Chinese?

A better policy than trying to force hackish transliteration into your application is to figure out how to support Unicode properly in the first place. If you have to have an all-ASCII representation of non-Latin-script text, use URL encoding.

pat