views:

1102

answers:

5

Are there any solutions that will convert all foreign characters to A-z equivalents? I have searched extensively on Google and could not find a solution or even a list of characters and equivalents. The reason is I want to display A-z only URLs, plus plenty of other trip ups when dealing with these characters.

A: 

The problem with your query is that it is a very hard thing to do. Not all glyphs in most languages have a-z equivalents, all glyphs have phonetic equivalents (but these are words not letters), if you are just dealing with Latin based languages then things are a little easier but you still have issues with things like I-mutation.

Your best solution word be to come up with a crude list of phonetic sounds -> a-z equivalents, it won't be perfect but without any more information on you exact requirements it is hard to develop a solution.

Jamie Lewis
I am mosting dealing with European languages, a rough solution would be fine, I once found a big list in the source of another script, but have completely lost it.
esryl
+1  A: 

The strtr manual page lists a few possibilities in the comments, such as

function normalize ($string) {
    $table = array(
        'Š'=>'S', 'š'=>'s', 'Đ'=>'Dj', 'đ'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'Č'=>'C', 'č'=>'c', 'Ć'=>'C', 'ć'=>'c',
        'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
        'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
        'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
        'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
        'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
        'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
        'ÿ'=>'y', 'Ŕ'=>'R', 'ŕ'=>'r',
    );

    return strtr($string, $table);
}
Sebastian P.
Thank you, that is a good start, still missing characters I need like ü. I think I need to go through each alphabet one at a time.
esryl
The problem with the above solution is that they are not equivalent æ is more equivalent to the phonetic 'ay' or the English 'ae'. It really depending on the original posters needs. If he wants an English phonetic equivalent or a 'vanity' translation
Jamie Lewis
hi jamie, yeah i noticed the ae not being done correctly in that. so i am currently compiling my own list on a per language basis, at least with my new found mb understanding i can do simple replaces. i was hoping someone somewhere would have this monster list of letters and equivalents already completed.
esryl
Take a look at my answer, it solves this problem quite nicely 'æ' comes up as 'ae'.
Alix Axel
For turkish 'ü'=>'u', 'Ü'=>'U' ,'ğ'=>'g', 'Ğ'=>'G', 'ş'=>'s', 'Ş'=>'S'
nerkn
Some additions : 'ı'=>'i', 'İ'=>'I'
nerkn
+5  A: 

You can use iconv, which has a special transliteration encoding.

When the string "//TRANSLIT" is appended to tocode, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several characters that look similar to the original character.

-- http://www.gnu.org/software/libiconv/documentation/libiconv/iconv_open.3.html

See here for a complete example that matches your use case.

troelskn
i had just stumbled across iconv as my research continued, thank you very much for linking me to the complete example. thanks.
esryl
A: 

If you don't have access to iconv and if you don't want to use long lookup tables like the one Sebastian P. suggested you can take advantage of the HTML entity representation for each character like this:

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

As far as I'm aware of it doesn't work for Chinese, Japanese and other exotic charsets but works just fine for all the other languages.

Alix Axel
+2  A: 

If you are using iconv then make sure your locale is set correctly before you try the transliteration, otherwise some characters will not be correctly transliterated

setlocale(LC_CTYPE, 'en_US.UTF8');
Shane O'Grady