tags:

views:

142

answers:

4

I am writing a badword filter in PHP.

I have a list of badwords in an array and the method cleanse_text() is written like this:

public static function cleanse_text($originalstring){
   if (!self::$is_sorted) self::doSort();
   return str_ireplace(self::$badwords, '****', $originalstring);
}

This works trivially, for exact matches, but I wanted to also censor words that have been disguised like 'ab*d' where 'abcd' is a bad word. This is proving to be a bit more difficult.

Here are my questions:

  1. Is a badword filter worth bothering with (it is a site for professionals so a certain minimum decorum is required - I would have thought)

  2. Is it worth the hustle of trying to capture obvious work arounds like 'f*ck' - or should I not attempt to filter those out.

  3. Is there a better way of writing the cleanse_text() method above?

+2  A: 

If it's a website for professionals, then don't bother. You won't see much cursing in the first place, and when you do it will most likely be for comedic effect or similar. The people that do swear a lot in an immature manor will be punished by simply making a bad impression on everyone. (And those who completely overdo it should be dealt with by moderators anyway, so that shouldn't be an issue.)

What happens when you try to implement a bad word filter is you end up censoring completely benign uses of swear words, and in many cases, you also censor words that are not swear words but are similar enough for the filter to catch. (It's called the Scunthorpe problem, as @deceze mentioned in the comments.) Also, unless you go all-out, it will be really easy to circumvent. All-in-all, I'd say it's not worth the effort.

Take Stack Overflow as an example. It has no bad word filter, and it's doing just fine--I haven't heard of any problems with that kind of thing.

musicfreak
+7  A: 

I definitely wouldn't bother with it.

  1. It's a site for professionals, so you can assume that they will act appropriately. Some moderation and enforcement of rules will put anyone in line. Look at Stack Overflow for example. Even without the community moderation tools, people can be pressured into behaving appropriately.

  2. It would fail. There would be too many false positives ("clbuttic"), and making a list which contained all possible swear words would be impossible to maintain. Replacing certain letters (eg: f*ck) makes it no less offensive. Removing the word altogether destroys meaning, which is a huge problem with false positives.

  3. Consider a discussion about donkeys and birds. It'd be all about asses, tits, boobies and cocks.

nickf
+1 for #3 ` ` ` `
Chacha102
"Consider a discussion about donkeys and birds ...", I had to bite my lips HARD to prevent my self laughing out LOUD at this one. I see the point you are making though .. ;)
morpheous
terrible...just terrible. +1 for #3 - terrible or not, hilarious.
dboarman
A: 

Okay, here's a different idea:

I don't know what content you are filtering, but I'll just assume it's a comment system since this will still apply for whatever else it might be.

You probably have some kind of administrative interface. What if every time someone includes a possible "bad word" in a comment it leaves a note for you in said interface. Or sends you a daily email of all the maybe-profanity that appeared on your site. There could be links next to each listing that when clicked, would automatically apply a filter to that comment/post/whatever, or delete it, or whatever you want. Then you could just glance at the report, click once or twice to clean up the site, and be done with it.

You might think this wouldn't scale. It probably wouldn't. But if your site doesn't get a tonne of traffic, you might not even get a report every day. Or every week. You might not have to intervene much at all. No lists, no thinking of every possible objectionable word and all of their possible spellings, no false-positives.

It could work.

Carson Myers
This isn't a bad idea, and isn't far off from what I suggested or what he already has implemented. He would still need a filter (like the one I provided) to make this work. Even better than your solution, why not implement the filter as I suggested (replacing or removing the bad word) AND THEN have it email the admin. The filter is correct more than it is wrong, but if it is wrong then the admin could correct the situation and change ****ake mushrooms to shitake mushrooms.
typoknig
A: 

It's all 8u11$#1+ anyway. Just post a human-readable rule, let humans flag offensive contributions and ban offenders.

Eric Mickelsen