views:

1651

answers:

10

Here's a clbuttic question. I collect code that attempts to do profanity filtering. I personally like profanity, and whenever possible try to talk everyone out of using profanity filters. The filters always run into the very embarrassing Scunthorpe Problem which tends to make things worse. Of course there are sites that legitemately need profanity filters - mostly sites for children which have to answer to certain governmental guidelines, etc.

I just find reading profanity-detecting regexes and lists absolutely hilarious. I'll put my collection online once I get it ready, but meanwhile, do you have some profanity-related code/dictionary/tool/whatever to share? How about a profanity generator?

If you have a good long sample, please pastebin it here, if it's short, just put it in comments.

+13  A: 

the best profanity generators i know of are in the U.S. Navy. They're called "sailors".

Steven A. Lowe
We used to sit around making up german words that sounded like profanity then making up the matching definition.
EBGreen
a friend of mine, a sailor on his first leave, dropped an ashtray on the floor, and released a stream of invective that was blistering, inventive, physically impossible, blasphemous, and nearly poetic, all at once. I cannot imagine what he might have said if the ashtray had landed on his foot
Steven A. Lowe
I really hope the OP picks a different answer... No offense Steven - but this answer isn't worth 150 extra rep...
Erik Forbes
@[Erik]: I would be astonished if this was picked as the answer, but then again i'm also astonished that this question hasn't been closed.
Steven A. Lowe
I can't believe you got the 75pts! hahahaha.
Keng
+1  A: 

When we last did this we generated a reg-ex on the fly from a database of offensive words, but it became apparent that we needed to support number-letter substitions. eg. sh1t or ar5e. Turned out to be quite a big reg-ex!

However, rather than source an existing list of offensive words I just told the studio I was looking for some suggestions - it didn't take long! :)

Macka
do you have any code samples?
deadprogrammer
+2  A: 

Dansguardian does a pretty good job, as long as you properly edit the weighted phrases. They leave out some whitespace and so you get false positives when you visit a clothing store in virginia. And it always trips when you visit a store with e**x**tra e**x**tra e**x**tra large clothing sizes.

The url regex that comes with it is totally worthless and when I set it up I always disable that.

http://usmirror.dansguardian.org/downloads/2/Stable/dansguardian-2.10.0.1.tar.gz

Georg Zimmer
A: 

The best advice (course you already know) I have is that word boundaries are your friends! ("\b")

80)

PS: I added the clbuttic tag in my RSS reader....looking forward to seeing it populate.

Keng
+8  A: 

The perl module Regex::Common::profanity will be of interest to you. It even ROT-13s the actual terms in the regexes, to spare one's feelings when reading the source: shpxvat considerate of them.

The module Regex::Profanity::US has the added feature of attempting to determine a degree of profanity.

AmbroseChapel
+1  A: 

I would also look for doing the leet speak stuff as Macka mentioned for example. The thing would be to get your list and do a leet-substitution for all the vowels (ie e->3;i->1; etc) thereby creating a whole separate list of banned words.

However, I wouldn't use regex to do the extraction/substitution; it wouldn't be efficient enough for such a broad dictionary search/replace.

Keng
+1  A: 

The only application I've had for anything like this is trying not to accidentally generate offensive words when creating codes that are used for redeeming things (like money deposited in a print account) or initial passwords. I got around it by removing all vowels from the potential alphabet. With this solution, the worst that will happen is that a person might complain that a string of letters suggests a profane word, but they won't actually see one (at least any English ones).

tvanfosson
I had a similar requirement once and solved it by using only digits for customer codes (more than three digits so no one got "666").
Dour High Arch
+2  A: 

Can we rename it from the "Scunthorpe" problem to the "expertsexchange" problem?

Jason Punyon
Is that what people are using to replace the c-word with these days?
Alan Moore
+1  A: 

You could do this with markov chains; build a large database of profanity intense insults and use them to generate the table of transition probabilities, then curse like a sailor to your hearts content. Jeff had a pretty good post about this technique a while back.

Kevin Loney
Sure, Markov chains are great for generating pseudo gibberish, but the question is more about where to get the sample text to feed it.
deadprogrammer
It depends on the length of the chain used to generate the transition table and they can perform reasonably well, and you did ask about profanity generation.
Kevin Loney
A: 

my advice is to use a web service like webpurify http://www.webpurify.com it's very accurate.

jfreger