views:

153

answers:

2

Does someone know a easy way to find characters in Unicode that are similar to ASCII characters. An example is the "CYRILLIC SMALL LETTER DZE (ѕ)". I'd like to do a search and replace for similar characters. By similar I mean human readable. You can't see a difference by looking at it.

A: 

See the Unicode Database: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.

Each line describes a unicode caharacter, for example:

1E9A;LATIN SMALL LETTER A WITH RIGHT HALF RING;Ll;0;L;<compat> 0061 02BE;;;;N;;;;;

If there's any similar (compatible) characters for that symbol, it will appear in the <compat> field of the entry. In this example, 0061 (ASCII a) is compatible to the LATIN SMALL LETTER A WITH RIGHT HALF RING Unicode character.

As for your character, the entry is

0455;CYRILLIC SMALL LETTER DZE;Ll;0;L;;;;;N;;;0405;;0405

which, as you can see, does not specify a compatibility character.

adamk
The compatibility field describes a sequence of characters that'd mean the same thing as the character in question. In your example, the compatible sequence would be `U+0061` (the letter 'a') followed by `U+02BE` (the 'right half ring' modifier). For characters from different alphabets, it'd be pretty unusual for there to be compatibility sequences -- and that'd make what the OP is trying to do impossible without more info.
cHao
The OP stated 'similar to ASCII characters', not exact. If you're looking for an 'a' with a right half ring, you could settle for an ASCII 'a' if there's nothing else available.
adamk
Agreed -- in that case. But if you're looking for an ASCII char similar to a Cyrillic ѕ, which is the very example the OP used, that won't work.
cHao
@cHao: You're right - as I stated in my answer, for the specific character the OP requested, the compatibility characters aren't a useful method.
adamk
+4  A: 

As noted by other commenters, Unicode normalisation ("compatibilty characters") isn't going to help you here as you aren't looking for official equivalences but for similarities in glyphs (letter shapes). (The linked Unicode Technical Report is still worth reading, though, as it is extremely well written.)

If I were you, to spare you the tedious work of assembling a list of characters yourself, I'd search for resources on homograph attacks: This is a method of maliciously misleading web users by displaying URLs containing domain names in which some letters have been replaced with visually similar letters. Another Unicode Technical Report, on security, contains a section on the problem. There is also -- and that may be what you most need -- a "confusables" table. Here's another article with mainly punctuation marks, some of which ASCII, that have visually similar counterparts in the non-ASCII code tables.

What I do hope is that you aren't asking the question to construct such an attack.

chryss
Thanks for all the good links and explanations. I actually try to protect against such attacks. :-) And I guess I will find some further stuff with the keyword "homograph attack".
DrDol
That is good to hear :) . Yeah, that's the keyword you need! I edited a link (it pointed to an obsolete version).
chryss