views:

3339

answers:

3

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).

This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "Подражанская". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.

So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?

It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.

A: 

If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?

innaM
Within over 10 years of shareware development, I only had a handful customers from Japan and China. Unicode-enabling all of my shareware programs just to take care of a mild annoyance would be exaggerated. I am more looking for a quick and dirty approach in this case.
Adrian Grigore
So maybe (just maybe), you might find a lot more customers if you enabled utf-8?
innaM
A few: yes. A lot and worth the time of development: No. Piracy is a very big issue in the shareware business, especially in countries like china. The Japanese market is not bad, but from what I have heard from other shareware authors it usually is not worth it unless you have a really big title.
Adrian Grigore
+10  A: 

I believe you could use Text::Unidecode for this, it is precisely what it tries to do.

mirod
Just what I was looking for - Thanks! :-)
Adrian Grigore
A: 

If you get cyrilic text there is no "closest ASCII representation" for many characters.

Nemanja Trifunovic
+1. Transliteration is not a simple business of substituting single characters. Either support Unicode properly or only support ASCII; anything in between gets messy quick.
bobince
Nevertheless whenever I ask someone from russia for his name, he is able to provide a latin character version of it. I am aware that some characters are only rough approximations, but obviously there has to be a solution to my problem.
Adrian Grigore
Well, some names they give you as latin equivalents aren't their "real" names.
brian d foy
What they give you is a way to pronounce their names - transcription, while you are looking for transliteration which is a different problem.
Nemanja Trifunovic
I agree. If there were ASCII/Latin equivalents for these characters they wouldn't have had to invent Unicode in the first place.
AmbroseChapel
I am born in Romania, so I know that what they give me is not their real name. But "close enough" will do in this situation. Businesswise it makes no sense to add Unicode support to accomodate less than 0.1% of my users. I might as well implement something more useful instead.
Adrian Grigore