views:

82

answers:

2

I believe, there is an algorithm, which can equal two strings with similar typefaces of a characters, but different symbols (digits, Cyrillic, Latin or other alphabets). For example:

  • "hello" (Latin symbols) equals to "he11o" (digits and Latin symbols)
  • "HELLO" (Latin symbols) equals to "НЕLLО" (Cyrillic and Latin symbols)
  • "really" (Latin symbols) equals to "геа11у" (digits and Cyrillic symbols)
+1  A: 

I am not exactly sure what you are asking for.

If you want to know whether two characters look the same under a given typeface then you need to render each character in the chosen fonts into bitmaps and compare them to see if they are close to being identical.

If you just want to always consider lower-case latin 'l' to be the same as the digit '1' regardless of the font used, then you can simply define a character mapping table. Probably the easiest way to do this would be to pick a canonical value for each set of characters that looks the same and map all members of the set to that character. When you compare the strings, compare the canonical instance of each character from the table.

Christopher Barber
Thanks for reply. It can be done that way. But maybe such algorithm is already exist, something like phonetic algorithms Metaphone, Soundex etc.
cubanacan
+1  A: 

You may be thinking of the algorithm that Paul E. Black developed for ICANN that determines whether two TLDs are "confusingly similar", though it currently does not work with mixed-script input (e.g. Latin and Cyrillic). See "Algorithm Helps ICANN Manage Top-level Domains" and the ICANN Similarity Assessment Tool.

Also, if you are interested in extending this algorithm, then you might want to incorporate information from the Unicode code charts, which commonly list similar glyphs and sequences of code points that render similarly.

Daniel Trebbien
Thanks for useful answer. For the first example (digits and Latin symbols) there is [The Code and The Algorithm](http://hissa.nist.gov/~black/GTLD/) (source code in Python)
cubanacan