views:

35

answers:

2

I'm filtering chat messages on a chat system where constraining strings to Latin-1 English is desirable. Users tend to use creative typing, e.g.

ßòógīě§

instead of

Boogies

In Java, there are unicode normalization methods which can remove diacritic marks, but I'm more interested in methods of normalizing the shapes of the letters towards English, and the Latin-1 character set.

Are there any tables, libraries or methods out there that can map common unicode characters outside Latin-1 to their nearest forms, visually? E.g.

ß -> B
§ -> S
¥ -> Y
¤ -> o

I suspect that the answer is "No, this would be too big, just filter them all out instead" but I can hope...

+1  A: 

I think your best bet is to use an OCR (optical character recognition) engine. After all, that's precisely what you're after: A best effort to parse the letters into readable A-Z characters. (Remember to print the chat-messages onto an image using the same font as used in your chat-client.)

Two Java-OCR libraries:

aioobe
A: 

The correct solution is not to install idiotic "profanity filters" (which I assume are behind this request). If the community cannot police itself at all in that regard, moderate it manually and ban offenders, or shut it down. Having to wrestle with the Scunthorpe problem will offend your users much more than some swearing kids.

Michael Borgwardt
Possibly, but it is possible to offend users by filtering, and parents of users by not filtering. In any case the filtering is being done already and this is not really an answer to the question posed. Understanding the shape of letter forms will lead to an understanding of the intent behind the message and ultimately less messages being blocked.
izb