ansaurus

Question

How do you implement a good profanity filter?

Answer 1

+60 A:

Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?

Also, one can't forget The Untold History of Toontown's SpeedChat, where even using a "safe-word whitelist" resulted in a 14 year old quickly circumventing it with: "I want to stick my long-necked Giraffe up your fluffy white bunny."

Bottom line: Ultimately, for any system that you implement, there is absolutely no substitute for human review (whether peer or otherwise). Feel free to implement a rudimentary tool to get rid of the drive-by's, but for the determined troll, you absolutely must have a non-algorithm-based approach.

A system that removes anonymity and introduces accountability (something that Stack Overflow does well) is helpful also, particularly in order to help combat John Gabriel's G.I.F.T.

You also asked where you can get profanity lists to get you started -- one open-source project to check out is Dansguardian -- check out the source code for their default profanity lists. There is also an additional third party Phrase List that you can download for the proxy that may be a helpful gleaning point for you.

Edit in response the question edit: Thanks for the clarification on what you're trying to do. In that case, if you're just trying to do a simple word filter, there are two ways you can do it. One is to create a single long regexp with all of the banned phrases that you want to censor, and merely do a regex find/replace with it. A regex like:

$filterRegex = "(boogers|snot|poop|shucks|argh)"

and run it on your input string using preg_match() to wholesale test for a hit,

or preg_replace() to blank them out.

You can also load those functions up with arrays rather than a single long regex, and for long word lists, it may be more manageable. See the preg_replace() for some good examples as to how arrays can be used flexibly.

For additional PHP programming examples, see this page for a somewhat advanced generic class for word filtering that *'s out the center letters from censored words, and this previous Stack Overflow question that also has a PHP example (the main valuable part in there is the SQL-based filtered word approach -- the leet-speak compensator can be dispensed with if you find it unnecessary).

You also added: "Getting the list of words in the first place is the real question." -- in addition to some of the previous Dansgaurdian links, you may find this handy .zip of 458 words to be helpful.

HanClinto 2008-11-07 20:21:12

Lol, I was about to point to there! ;-)

PhiLho 2008-11-07 20:21:47

+1 just for using 'fluffy bunny'!

Mitch Wheat 2008-11-08 01:36:44

@JPLemme: Yes it should -- I should have added [sic] afterwards, since that's how Atwood spelled it. :)

HanClinto 2008-11-11 18:37:30

"Club Penguin" adds hundreds of entries to their profanity filter *every day*: http://www.raphkoster.com/2008/05/09/club-penguin-adds-1000-words-a-day-to-their-filter/

Frank Farmer 2009-06-20 00:02:58

A word boundary wrapper around your regex options would prevent the **clbuttic** mistake

ck 2010-04-27 13:03:19

@ck: Only if you're not worried about being able to filter out mis-spelled words "F*ckkkk yo' asssss" :)I'm not sure I trust my trolls to have very precise spelling.

HanClinto 2010-04-27 20:46:33

Answer 2

+15 A:

Don't.

Because:

Clbuttic
Profanity is not OMG EVIL
Profanity cannot be effectively defined
Most people quite probably don't appreciate being "protected" from profanity

Edit: While I agree with the commenter who said "censorship is wrong", that is not the nature of this answer.

eyelidlessness 2008-11-07 20:22:31

This is the only acceptable answer.

jonnii 2008-11-07 20:23:53

10 upvotes for this non-answer? As if anybody who wants to filter profanity must be a moralizing half wit? Good grief. This is a valid question and snarky drive-by responses shouldn't be rewarded. -1.

Kluge 2008-11-07 21:45:17

what constitutes a 'swear' is debatable. Censorship in any form is bad.

Mitch Wheat 2008-11-08 01:38:01

@Kludge: You're the only one who said "moralizing half wit", in fact I said nothing about the moral nature of implementing a profanity filter at all. Mitch brings up part of the reason I said "don't", and it's not a snarky drive-by. Sometimes "don't" is the correct answer to "how do I...?" [cont'd]

eyelidlessness 2008-11-08 01:57:22

I happen to think this is one of those times. That you disagree is fine, but I don't think you should read too much into it. And if you want to ask *why* I think it shouldn't be done, I'd be happy to clarify the answer.

eyelidlessness 2008-11-08 01:58:45

@eyelidlessness: Perhaps you are right that I read too much into your single-word answer. But since you didn't elaborate, I couldn't tell if your objections were on moral grounds or technical ones. I'll admit that I'm tired of "censorship in any form is bad" comments.

Kluge 2008-11-08 17:09:13

Why are you tired of them? Maybe you disagree, but it's a perfectly legitimate perspective. (Note: that is *not* the nature of my answer.)

eyelidlessness 2008-11-09 17:21:08

+1 due to Kluge's response.

Joel Mueller 2009-06-25 20:55:17

Answer 3

A:

Hell if I know. :)

On a more serious note, I don't think you can do it reliably. A friend of mine is named Titsworth and he said profanity filters at his high school always displayed his name as T***worth.

I think there will always be too many edge cases to do it 100%.

Josh Bush 2008-11-07 20:23:15

Answer 4

A:

Frankly, I'd let them get the "trick the system" words out and ban them instead, which is just me. But it also makes the programming simpler.

What I'd do is implement a regex filter like so: /[\s]dooby (doo?)[\s]/i or it the word is prefixed on others, /[\s]doob(er|ed|est)[\s]/. These would prevent filtering words like assuaged, which is perfectly valid, but would also require knowledge of the other variants and updating the actual filter if you learn a new one. Obviously these are all examples, but you'd have to decide how to do it yourself.

I'm not about to type out all the words I know, not when I don't actually want to know them.

The Wicked Flea 2008-11-07 20:25:28

Answer 5

+8 A:

I don't know of any good libraries for this, but whatever you do, make sure that you err in the direction of letting stuff through. I've dealt with systems that wouldn't allow me to use "mpassell" as a username, because it contains "ass" as a substring. That's a great way to alienate users!

Matt Passell 2008-11-07 20:26:00

or forbidding "cockpit" in a flying spaceships game

Shinhan 2008-11-07 21:28:31

Answer 6

+7 A:

Have a look at CDYNE's Profanity Filter Web Service

Testing URL

Tim Cavanaugh 2008-11-07 20:27:24

Answer 7

A:

Don't. It just leads to problems. One clbuttic personal experience I have with profanity filters is the time where I was kick/banned from an IRC channel for mentioning that I was "heading over the bridge to Hancock for a couple hours" or something to that effect.

Adam Jaskiewicz 2008-11-07 20:37:46

Answer 8

+5 A:

The only way to prevent offensive user input is to prevent all user input.

If you insist on allowing user input and need moderation, then incorporate human moderators.

Axel 2008-11-07 20:42:39

Answer 9

+2 A:

If you can do something like Digg/Stackoverflow where the users can downvote/mark obscene content... do so.

Then all you need to do is review the "naughty" users, and block them if they break the rules.

scunliffe 2008-11-07 20:46:59

Answer 10

+2 A:

OT : I got lots of childish pleasure from searching for a bad word list a year or two ago. That is the only positive thing I gained from my profanity filtering experiences.

Please bear in mind when you read this post that I live about 20 miles from Scunthorpe.

ZombieSheep 2008-11-07 21:27:42

Answer 11

+5 A:

a profanity filtering system will never be perfect, even if the programmer is cocksure and keeps abreast of all nude developments

that said, any list of 'naughty words' is likely to perform as well as any other list, since the underlying problem is language understanding which is pretty much intractable with current technology

so, the only practical solution is twofold:

be prepared to update your dictionary frequently
hire a human editor to correct false positives (e.g. "clbuttic" instead of "classic") and false negatives (oops! missed one!)

Steven A. Lowe 2008-11-07 22:27:08

Answer 12

+10 A:

During a job interview of mine, the company CTO who was interviewing me tried out a word/web game I wrote in Java. Out of a word list of the entire Oxford English dictionary, what was the first word that came up to be guessed?

Of course, the most foul word in the English language.

Somehow, I still got the job offer, but I then tracked down a profanity word list (not unlike this one) and wrote a quick script to generate a new dictionary without all of the bad words (without even having to look at the list).

For your particular case, I think comparing the search to real words sounds like the way to go with a word list like that. The alternative styles/punctuation require a bit more work, but I doubt users will use that often enough to be an issue.

Matthew 2008-11-07 22:36:23

Answer 13

+5 A:

Regarding your "trick the system" subquestion, you can handle that by normalizing both the "bad word" list and the user-entered text before doing your search. e.g., Use a series of regexes (or tr if PHP has it) to convert [z$5] to "s", [4@] to "a", etc., then compare the normalized "bad word" list against the normalized text. Note that the normalization could potentially lead to additional false positives, although I can't think of any actual cases at the moment.

The larger challenge is to come up with something that will let people quote "The pen is mightier than the sword" while blocking "p e n i s".

Dave Sherohman 2008-11-08 01:35:13

Answer 14

+1 A:

I agree with HanClinto's post higher up in this discussion. I generally use regular expressions to string-match input text. And this is a vain effort, as, like you originally mentioned you have to explicitly account for every trick form of writing popular on the net in your "blocked" list.

On a side note, while others are debating the ethics of censorship, I must agree that some form is necessary on the web. Some people simply enjoy posting vulgarity because it can be instantly offensive to a large body of people, and requires absolutely no thought on the author's part.

Thank you for the ideas.

HanClinto rules!

2009-02-24 20:30:20

Answer 15

+2 A:

I have had a lot of success using WebPurify (www.webpurify.com) Trying to write your own profanity filter can drive you crazy, these guys seem to have it figured out.

2009-05-29 02:50:08

Answer 16

+5 A:

Webpurify.com is an api that will handle those profanity filtering needs. I was teferred by a programmer buddy and have been using their service for a few years now and am very pleased. User can create "white lists" and "black lists". The service offers a few diff filtering options like providing a profanity count or simply replacing profanity with symbols.

Josh B 2010-02-18 15:07:38

Answer 17

+3 A:

If you want a full solution with a Java filter, WebService, and a web application to manage everything, Inversoft sells a product called Clean Speak. It is extremely fast and accurate. It handles replacement characters (a$$) and spaces (a s s). Many other filters seem to break easily with spaces and other punctuation.

You can deploy the software on your servers and integration is pretty simple. They also offer a number of additional tools to helping manage User Generated Content like a moderation system and a monitoring system.

Brian P 2010-03-08 06:03:37

Answer 18

+2 A:

Beware of localization issues: what is a swearword in one language might be a perfectly normal word in another.

One current example of this: ebay uses a dictionary approach to filter "bad words" from feedback. If you try to enter the german translation of "this was a perfect transaction" ("das war eine perfekte Transaktion"), ebay will reject the feedback due to bad words.

Why? Because the german word for "was" is "war", and "war" is in ebay dictionary of "bad words".

So beware of localisation issues.

Sam 2010-04-27 12:55:55

Answer 19

A:

"heading over the bridge to Hancock for a couple hours"

that's a great new euphemism!

Anentropic 2010-07-01 13:59:05

sorry I would have just made this a comment, but Stack Overflow won't let me comment yet, except on my own answers apparently.

Anentropic 2010-09-20 16:19:39

Answer 20

A:

I agree with the futility of the subject, but if you have to have a filter, check out Ning's Boxwood:

Boxwood is a PHP extension for fast replacement of multiple words in a piece of text. It supports case-sensitive and case-insensitive matching. It requires that the text it operates on be encoded as UTF-8.

Also see this blog post for more details:

Fast Multiple String Replacement in PHP

With Boxwood, you can have your list of search terms be as long as you like -- the search and replace algorithm doesn't get slower with more words on the list of words to look for. It works by building a trie of all the search terms and then scans your subject text just once, walking down elements of the trie and comparing them to characters in your text. It supports US-ASCII and UTF-8, case-sensitive or insensitive matching, and has some English-centric word boundary checking logic.

Gordon 2010-09-30 09:01:18

ansaurus

tags:

views:

answers:

How do you implement a good profanity filter?

Edit: Response to answers that say simply avoid the programmatic issue:

related questions