tags:

views:

400

answers:

10

Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check for the presence of anything in the list? Should I try to use a regex to capture a bunch of variations on a single root?

If it makes it easier, I don't want to filter out, just to get a count. So if there are some false positives, it's not the end of the world, as long as there's a more or less uniformly over exaggerated rate.

+2  A: 

I'd make a huge list.

Regex'es have the problem of misfiring, when applied to natural language - especially with an amount of exceptions English has.

EFraim
Huge lists of regular expressions might be better. Suppose "nose" was a bad word - I could write it "n ose" or "n0se" or "noze" or "n0z3" or whatever, and you'd want to find it. Having character classes like [0oO] and [sSzZ] is going to make the list much easier.
David Thornley
@David: and much more prone to errors. I've seen forums blocking legitimate words just because the authors have not realized that their regex actually captures something legitimate - and this was Russian, where such thing is much less likely than in English (less words differing by a single letter, say)
EFraim
A regular expression is just another way if implementing a trie, which is just a more compact way of implementing a word list. You're going to have to balance between the Scunthorpe problem and character replacement attacks no matter what algorithm you use.
Ken Bloom
And there's no perfect solution. For example, you can buy nice writing implements from www.penisland.com, but God forbid you try to google "Pen Island"...
Ken Bloom
A: 

It depends what your text source is, but I'd go for some kind of established and proven pattern matching algorithm, using a Trie for example.

flesh
+2  A: 

Note that any NLP logic like this will be subject to attacks of "character replacement":

For example, I can write "hello" as "he11o", replacing L's with One's. Same with obscenities. So while there's no perfect answer, a "blacklist" approach of "bad words" might work. Watch out for false positives (I'd run my blacklist against a large book to see what comes up)

Alex
A: 

Use the morphy lemmatizer built into WordNet, and then determine whether the lemma is an obscenity. This will solve the problem of different verb forms, plurals, etc...

Ken Bloom
+5  A: 

A huge list and think of the target audience. Is there 3rd party service that you can use that specialises in this rather than rolling your own?

Some quick thoughts:

  • The Scunthorpe problem (and follow the links to "Swear filter" for more)
  • British or American English? fanny, fag etc
  • Political correctness: "black" or "Afro-American"?

Edit:

gbn
+1  A: 

One problem with filters of this kind is their tendency to flag entirely proper English town names like Scunthorpe. While that can be reduced by checking the whole word rather than parts, you then find people taking advantage by merging their offensive words with adjacent text.

Mark Thornton
A: 

I would advocate a large list of simple regex's. Smaller than a list of the variants, but not trying to capture anything more than letter alternatives in any given expression: like "f[u_-@#$%^&*.]ck".

Software Monkey
A: 

You want to use Bayesian Analysis to solve this problem. Bayesian probability is a powerful technique used by spam filters to detect spam/phishing messages in your email inbox. You can train your analysis engine so that it can improve over time. The ability to detect a legitimate email vs. a spam email sounds identical to the problem you are experiencing.

Here are a couple of useful links:

A Plan For Spam - The first proposal to use Bayesian analysis to combat spam.

Data Mining (ppt) - This was written by a colleague of mine.

Classifier4J - A text classifier library written in Java (they exist for every language, but you tagged this question with Java).

Caleb Powell
Naive Bayesian text classifiers don't really help with the problem of finding individual words in a document. Filtering obscenities out of a document (when presumably you want to keep the rest of the document) is very different from detecting whether the email as a whole has valuable content. Moreover, a lot of stuff will get through if you try to filter out swear words by blocking the whole email based on content.
Ken Bloom
The question states that there is no intention to 'filter out' anything; only to count the number of obscenities in text. One problem raised is that a word may or may not be an obscenity (depending on it's lexical context). Bayesian probability would be useful in determining whether a sentence (or a larger piece of text) contains profanity. Sentences that are found likely to contain profanity could then be fed into a separate algorithm responsible for counting each obscenity (so it is used as a filter, but only to locate text that has a high probability of being offensive).
Caleb Powell
I think lexiccal context is a lot less of an issue than you make it out to be, and would only add Bayesian classification to deal with the context if a purely word-list based approach didn't help. Even so, I wouldn't do it on sentences, rather I'd do it on a 5 to 10 word window around the lexeme in question.
Ken Bloom
+2  A: 

Is the phrase I want to stick my long-necked Giraffe up your fluffy white bunny obscene?

Pete Kirkham
How did I know you're a Brit before I looked at your profile... ;-)
gbn
A: 

There are webservices that do this kind of thing in English.

I'm sure there are others, but I've used WebPurify in a project for precisely this reason before.

Owen Blacker