tags:

views:

48

answers:

3

There are some letters in different alphabets, that are looking totally the same.

Like A in latin and А in cyrillic.

Do they play the same role, when I call one of them through utf-8 script?

If aren't, how to get know code of given letter?

+6  A: 

It's not clear what you mean by "play the same role".

They are certainly not the same character, though they may appear to be when rendered.

This is exactly analogous as the confusion between "l" (lowercase L) and "I" (uppercase i) in many fonts.

If you want to consider A and А to be the same, you have to transliterate the Cyrillic into a Latin one. Unfortunately, PHP support for transliteration is sketchy. You can use iconv, which is not great -- if you transliterate to ASCII, you'll lose everything that cannot be represented in ASCII.

The Unicode PHP implementation (what was supposed to be PHP 6) had a function called str_transliterate that used the ICU transliteration API. Hopefully, transliteration will be added to the intl extension (the current ICU wrapper) in the future.

Artefacto
+1  A: 

They're certainly not the same. PHP doesn't use eyes or OCR to determine what letter a character is.

$latinA = 'A';
$cyrillicA = 'А';

var_dump($latinA == $cyrillicA); // bool(false)
BoltClock
You cannot use `ord` in the cyrillic character. It's composed of two bytes. You're getting the leading byte only.
Artefacto
@Artefacto: good catch, didn't know `ord()` isn't multibyte-compatible.
BoltClock
+1  A: 

You might be interested in the 'spoof detection' API in ICU. I think it is designed to report that your two As are 'visually confusable'.

Steven R. Loomis