views:

227

answers:

4

How have you like minded individuals tackled the basic challenge of filtering profanity, obviously one can't possibly tackle every scenario but it would be nice to have one at the most basic level as a first line of defense.

In Obj-c I've got

NSString *tokens = [text componentsSeparatedByString:@" "];

And then I loop through each token to see if any of the keywords (I've got about 400 in a list) are found within each token.

Realising False positives are also a problem, if the word is a perfect match, its flagged as profanity otherwise if more than 3 words with profanity are found without being perfect matches it is also flagged as profanity.

Later on I will use a webservice that tackles the problem more precisely, but I really just need something basic. So if you wrote the word penis it would go yup naughty naughty, bad word written.

+5  A: 

Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?

Jeff has an interesting article to consider before embarking on such a piece of code:

http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html

Mark Harrison
Agreed, obscenity filters are a terrible idea.
Michael Aaron Safyan
I would in any normal case be all for avoiding a profanity filter, it seems like a waste of time to me also...BUT the client aka Customer has specifically requested it with an emphasis on disliking profanity, I realise 100% that my solution is a 'bandaid', but I need something to ship that at least solves the most general case.Version 2.0 will use more realistic filtering and a social reporting tool will be involved, aka 'report this user'. But until thena bandaid is what I need.
David van Dugteren
@David, that sounds reasonable.
Mark Harrison
+1  A: 

Well, searching in that manner is certainly not the most efficient way to search for profanity... a more efficient approach would be to construct a finite state automaton to detect the words, and run the text once through that FSA. You don't really need to split strings to find profanity, and all that splitting adds extra allocation and copying overhead that you don't need. Also, there may be common patterns in some of the blacklisted words, which you are not exploiting by searching each word individually.

That said, I think 400 words is quite a lot. Who, exactly, is your audience? What if a user has a medical question? Should such questions actually be disallowed? I can only think of a handful of words that would be considered profane in any context, so you might want to rethink the filtering.

Michael Aaron Safyan
You're right, the list is tentative, downloaded the list of a forum, the ultimate list will likely be cut down, but its something that will be altered as time goes on. I'm wanting to use the list as a base/starting point before pitching it to the customer who can ultimately decide what needs to be there and what doesn't. I'm looking into using a FSA regex rather than a linear while loop, I'm just not that familiar with the iPhone SDK yet, so once I suss out what the best way to do it is...I'll go for a GREP approach.
David van Dugteren
I'm going ahead and using regexlite, that should be a little more efficient.
David van Dugteren
+1  A: 

I just have a suggestion for tokenizing the string. Your ways works well if the words are all separated by strings but that is rarely the case in most usage scenarios as you would normally have to deal with newlines, punctuation, etc. Try this if you are interested:

NSMutableCharacterSet *separators = [NSMutableCharacterSet punctuationCharacterSet];

[separators formUnionWithCharacterSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

NSArray *words = [bigString componentsSeparatedByCharactersInSet:separators];

Source: http://www.tech-recipes.com/rx/3418/cocoa-explode-break-nsstring-into-individual-words/

sosborn
Thankyou so Much Sosburn! I'm now using your code! I appreciate your help there!
David van Dugteren
@David, You should select this as the answer then, if you are using this as your solution.
Tim Jarvis
Well its not the answer its a handy tip to compliment what I'm trying to achieve.
David van Dugteren
A: 

A couple of things:

  • FSA won't necessarily work depending on how intelligent you want the filter to be
  • Regex are generally extremely slow depending on how many you want to run
  • 400 words is somewhat low, depending on your needs and langauges
  • There are a number of extremely tricky cases to be careful of when filtering, particularly embedding of words such as "ASSume"

My company, Inversoft, builds a commercial filtering solution and it is quite intelligent. It doesn't use regex or FSA, but has a custom built fast-linear processing technology that makes it extremely fast and accurate (4,000+ messages per second). It also has over 600 English words in a number of categories including Slang, Racial Slurs, Drug, Gang, Religious, etc.

If you are looking for an intelligent context-aware solution with support, you should check out Clean Speak from Inversoft. Hooking it up to Obj-C should be simple using the XML WebService.

Brian P