ansaurus

Question

Answer 1

A:

See the Unicode Database: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.

Each line describes a unicode caharacter, for example:

1E9A;LATIN SMALL LETTER A WITH RIGHT HALF RING;Ll;0;L;<compat> 0061 02BE;;;;N;;;;;

If there's any similar (compatible) characters for that symbol, it will appear in the <compat> field of the entry. In this example, 0061 (ASCII a) is compatible to the LATIN SMALL LETTER A WITH RIGHT HALF RING Unicode character.

As for your character, the entry is

0455;CYRILLIC SMALL LETTER DZE;Ll;0;L;;;;;N;;;0405;;0405

which, as you can see, does not specify a compatibility character.

adamk 2010-08-04 09:07:06

The compatibility field describes a sequence of characters that'd mean the same thing as the character in question. In your example, the compatible sequence would be `U+0061` (the letter 'a') followed by `U+02BE` (the 'right half ring' modifier). For characters from different alphabets, it'd be pretty unusual for there to be compatibility sequences -- and that'd make what the OP is trying to do impossible without more info.

cHao 2010-08-04 11:38:04

The OP stated 'similar to ASCII characters', not exact. If you're looking for an 'a' with a right half ring, you could settle for an ASCII 'a' if there's nothing else available.

adamk 2010-08-04 12:10:43

Agreed -- in that case. But if you're looking for an ASCII char similar to a Cyrillic ѕ, which is the very example the OP used, that won't work.

cHao 2010-08-04 12:35:53

@cHao: You're right - as I stated in my answer, for the specific character the OP requested, the compatibility characters aren't a useful method.

adamk 2010-08-04 13:31:02

Answer 2

+4 A:

As noted by other commenters, Unicode normalisation ("compatibilty characters") isn't going to help you here as you aren't looking for official equivalences but for similarities in glyphs (letter shapes). (The linked Unicode Technical Report is still worth reading, though, as it is extremely well written.)

If I were you, to spare you the tedious work of assembling a list of characters yourself, I'd search for resources on homograph attacks: This is a method of maliciously misleading web users by displaying URLs containing domain names in which some letters have been replaced with visually similar letters. Another Unicode Technical Report, on security, contains a section on the problem. There is also -- and that may be what you most need -- a "confusables" table. Here's another article with mainly punctuation marks, some of which ASCII, that have visually similar counterparts in the non-ASCII code tables.

What I do hope is that you aren't asking the question to construct such an attack.

chryss 2010-08-04 19:08:56

Thanks for all the good links and explanations. I actually try to protect against such attacks. :-) And I guess I will find some further stuff with the keyword "homograph attack".

DrDol 2010-08-04 22:34:33

That is good to hear :) . Yeah, that's the keyword you need! I edited a link (it pointed to an obsolete version).

chryss 2010-08-04 22:40:17

ansaurus

tags:

views:

answers:

Find similar ASCII character in Unicode

related questions