ansaurus

Question

What’s a good Python profanity filter library?

Answer 1

+1 A:

Profanity? What the f***'s that? ;-)

It will still take a couple of years before a computer will really be able to recognize swearing and cursing and it is my sincere hope that people will have understood by then that profanity is human and not "dangerous."

Instead of a dumb filter, have a smart human moderator who can balance the tone of discussion as appropriate. A moderator who can detect abuse like:

"If you were my husband, I'd poison your tea." - "If you were my wife, I'd drink it."

(that was from Winston Churchill, btw.)

Aaron Digulla 2010-08-20 14:39:40

Exactly. Profanity filters are pointless, at least until natural language parsers are much better.

delnan 2010-08-20 14:43:08

Who downvotes this, and why?

delnan 2010-08-20 14:54:35

@delnan: I guess because I asked what a good profanity filter library was, not whether I should use one at all. Suggestions like this can be better as comments, although they can be valid as answers too.

Paul D. Waite 2010-08-20 15:05:13

@Aaron: yeah, I’m not planning to have the machine deal with profanity on its own. But rather than making a human being look at every damn thing on the site, it’d be nice if the machine could offer suggestions of what’s worth taking a look at. (That’s not a criticism of your answer, as I didn’t provide any explanation of what I was going to use the filter for.)

Paul D. Waite 2010-08-20 15:07:24

@Aaron: oh, and I reckon it’ll be a lot longer than a couple of years before computers reliably understand English. And that the subset of people who care about the swears will not have gone away.

Paul D. Waite 2010-08-20 15:17:42

I downvote this - I categorically disagree with the concept of profanity being neutral.

Paul Nathan 2010-08-20 17:56:23

Aaron Digulla 2010-08-23 07:36:31

Personally, as a moderator, I'd let that one through on account of sheer quality.

intuited 2010-10-09 04:46:59

Answer 2

+15 A:

I didn't found any Python profanity library, so I made one myself.

Parameters

`filterlist`

A list of regular expressions that match a forbidden word. Please do not use \b, it will be inserted depending on inside_words.

Example: ['bad', 'un\w+']

`ignore_case`

Default: True

Self-explanatory.

`replacements`

Default: "$@%-?!"

A string with characters from which the replacements strings will be randomly generated.

Examples: "%&$?!" or "-" etc.

`complete`

Default: True

Controls if the entire string will be replaced or if the first and last chars will be kept.

`inside_words`

Default: False

Controls if words are searched inside other words too. Disabling this

Module source

(examples at the end)

"""
Module that provides a class that filters profanities

"""

__author__ = "leoluk"
__version__ = '0.0.1'

import random
import re

class ProfanitiesFilter(object):
    def __init__(self, filterlist, ignore_case=True, replacements="$@%-?!", 
                 complete=True, inside_words=False):
        """
        Inits the profanity filter.

        filterlist -- a list of regular expressions that
        matches words that are forbidden
        ignore_case -- ignore capitalization
        replacements -- string with characters to replace the forbidden word
        complete -- completely remove the word or keep the first and last char?
        inside_words -- search inside other words?

        """

        self.badwords = filterlist
        self.ignore_case = ignore_case
        self.replacements = replacements
        self.complete = complete
        self.inside_words = inside_words

    def _make_clean_word(self, length):
        """
        Generates a random replacement string of a given length
        using the chars in self.replacements.

        """
        return ''.join([random.choice(self.replacements) for i in
                  range(length)])

    def __replacer(self, match):
        value = match.group()
        if self.complete:
            return self._make_clean_word(len(value))
        else:
            return value[0]+self._make_clean_word(len(value)-2)+value[-1]

    def clean(self, text):
        """Cleans a string from profanity."""

        regexp_insidewords = {
            True: r'(%s)',
            False: r'\b(%s)\b',
            }

        regexp = (regexp_insidewords[self.inside_words] % 
                  '|'.join(self.badwords))

        r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)

        return r.sub(self.__replacer, text)


if __name__ == '__main__':

    f = ProfanitiesFilter(['bad', 'un\w+'], replacements="-")    
    example = "I am doing bad ungood badlike things."

    print f.clean(example)
    # Returns "I am doing --- ------ badlike things."

    f.inside_words = True    
    print f.clean(example)
    # Returns "I am doing --- ------ ---like things."

    f.complete = False    
    print f.clean(example)
    # Returns "I am doing b-d u----d b-dlike things."

leoluk 2010-08-20 17:26:42

Now that’s an answer.

Paul D. Waite 2010-08-20 18:45:44

Profanity isn't primarily about words, but usage; most words which can be used as "profanity" have perfectly "clean" uses, and it takes a lot more than a regex to distinguish them. (Never mind, of course, that anything like this will only prompt people to w*rk ar*und it.)

Glenn Maynard 2010-10-07 02:22:42

(I think it's pretty neat that just putting apostrophies in w*rds makes it look l*ke a *wear.)

Glenn Maynard 2010-10-07 02:37:59

@Glenn: yes, we know. We know filtering isn’t a complete solution to whatever profanity problem one has. We just want to know what the decent libraries are.

Paul D. Waite 2010-10-07 07:56:56

@Paul: Are you including leoluk in "we"? Any "decent library" is going to need to perform lexical analysis, bayesian heuristics or the like to discern different uses--not just run a regex. This code is cute, but isn't much more of a real-world solution than the bork below.

Glenn Maynard 2010-10-07 08:37:21

@Glenn: I wouldn’t dream of speaking for the good fellow. And not necessarily — because computers don’t understand English, the library is not going to be able to do the entire job itself, it’s going to need human help. So running a regex may turn out to be the right balance between power and comprehensible code. Hence I say “good library” and “decent library”, not “magical perfect library”.

Paul D. Waite 2010-10-07 08:48:39

@Paul: An approach that only searches for words without attempting to discern the context works fine for a small subset of language, but leaves a huge quantity of language undetectable. If blocking Carlin's list is all you want to do (to check a box on a feature requirement) then that's okay--but I think there's a significant area of practical analysis beyond that which can be done to make it something closer to practically useful. (@Brian's suggestion may be one, but I havn't tried it and they don't offer a public online demo. "Ask us for a demo" is never a promising sign.)

Glenn Maynard 2010-10-07 09:18:02

@Glenn: “If blocking Carlin's list is all you want to do (to check a box on a feature requirement)” — who said I want to do either of those things? You’re right that there is a lot of potential for doing something more useful than just regexing for words, but you haven’t offered any of that in an answer yet.

Paul D. Waite 2010-10-07 10:43:20

I pointed out that *this solution* is not very practically useful, and I did so because it seemed obvious that you were looking for something more than trivial word matching. I'm starting to regret wasting my time.

Glenn Maynard 2010-10-07 11:00:32

Answer 3

+2 A:

You can use Clean Speak from Inversoft to filter from Python via a WebService. Clean Speak is deployable software, so you can install it on your servers and don't have to worry about network hops or failures.

Brian Pontarelli 2010-08-23 16:17:58

And it is difficult to work around it.

leoluk 2010-10-07 05:25:44

I think you mean that network failures and latency are difficult to work around, which is true.

Brian Pontarelli 2010-10-15 15:39:27

Answer 4

A:

It's possible for users to work around this, of course, but it should do a fairly thorough job of removing profanity:

import re
def remove_profanity(s):
    def repl(word):
        m = re.match(r"(\w+)(.*)", word)
        if not m:
            return word
        word = "Bork" if m.group(1)[0].isupper() else "bork"
        word += m.group(2)
        return word
    return " ".join([repl(w) for w in s.split(" ")])

print remove_profanity("You just come along with me and have a good time. The Galaxy's a fun place. You'll need to have this fish in your ear.")

Glenn Maynard 2010-10-07 02:34:14

Answer 5

+2 A:

You could probably combine http://spambayes.sourceforge.net/ and http://www.cs.cmu.edu/~biglou/resources/bad-words.txt.

Meher 2010-10-07 02:40:10

I love that list.

Paul D. Waite 2010-10-07 07:58:42

Oh yes. Africa, Allah and heterosexual. (Was that list collected by a white gay christian?)

zoul 2010-10-07 08:13:14

@zoul: shut up, you big he--FILTERED--al.

Paul D. Waite 2010-10-07 08:57:21

Answer 6

+1 A:

Here's my Rather Lazy Attempt at this...My idea was to use the difflib. Ofcourse this isn't the entire program...but it should get you started :)

import difflib,re
badwords = map(lambda x: x.strip(), open("words.txt","r").readlines())

body = open("text.txt","r").read();

p = re.compile("\w+")

for word in set(p.findall(body)):
    lst = difflib.get_close_matches(word,badwords,n=1,cutoff=0.9) #configure the cutoff accordin to how strict you want the filter
    if lst :
        body = body.replace(word,"!@#&@%^tutsifruitsy$%&$")

print body

The tutsifruitsy is a reference to "The Late Late Show with Craig Ferguson" :)

st0le 2010-10-07 08:08:32

ansaurus

tags:

views:

answers: