I'm filtering chat messages on a chat system where constraining strings to Latin-1 English is desirable. Users tend to use creative typing, e.g.
ßòógīě§
instead of
Boogies
In Java, there are unicode normalization methods which can remove diacritic marks, but I'm more interested in methods of normalizing the shapes of the letters towards English, and the Latin-1 character set.
Are there any tables, libraries or methods out there that can map common unicode characters outside Latin-1 to their nearest forms, visually? E.g.
ß -> B
§ -> S
¥ -> Y
¤ -> o
I suspect that the answer is "No, this would be too big, just filter them all out instead" but I can hope...