views:

753

answers:

4

For an ASP.NET application, what is the Best Practice implementation method for a custom swear word remover/replacer?

If this is a data table solution, is there free resource to get the data? (Similar to finding a public dictionary table that you can import to your system for spellchecking)

+15  A: 

Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea? ^_^

Also see How do you implement a good profanity filter?.

PhiLho
Just about to post it :) +1
Sunny
That was my last +1 for the day... Six hours to reset. But it was worth it.
Chris Charabaruk
+6  A: 

The only way to win is not to play.

Consider the following sentence:

"Edward II was one of only a handful monarchs to give birth to a recorded bastard."

Bastard is a border line swear-word but in this context it is a completely sensible term.

Consider also:

  • "The molten slag fell out of the cruciable."
  • "The bitch sniffed the other dog's backside."

You are never going to be able to build a parser that is capable of working out whether the usage is correct. Even if you decided to go ahead anyway and just star out those words, they're easily subverted anyway.

Ask yourself, Is "Tw*t" really that much less offensive than "twat"? Everyone knows what word you're pointing to and everyone understands what it means.

Ultimately, the solution to this problem is not technological. Really, you want to use a human moderator of some sort to get rid of the people who swear. A human moderate has a facility that algorithms never will: it can exercise judgement. Using this judgement is far more useful than throwing computer-science at the problem.

This is discussed at length in other answer to this question.

Simon Johnson
Er...what's the bad word in that second example? Slag? Cruciable? Molten?
Kyralessa
@Kryalessa, it's "slag." It's a Brit thing.
Robert S.
+1  A: 

Well, what we (*) did is to create a two-tiered list of "bad words" (using regex to hopefully catch some variations). Using a Tier 1 word will get you a warning saying that you are violating the Terms of Service, and you cannot save that message until you fix it. If you use a Tier 2 word, the message is posted, but an objection is automatically filed against it. All message with an objection flagged (either system or user generated) are reviewed by a human who determined if it stays or goes.

(*) "We" being the e-commerce arm of a large, staid brick-and-mortar chain-store, which has just started allowing user-generated content on it's website.

James Curran
Seems like a sensible way of doing it to me. There are some words that, under any context, are just plain "bad", and others that depend on context, as has been pointed out.
Evan
+1  A: 

This is why REAL programming languages have a request.getUserIntent() method.

if(request.getUserIntent() == Intent.INSULTING) {
    rejectInput();
}
Adam Jaskiewicz
True, but only interpreted ones... ;)
ChrisA