views:

439

answers:

6

Like http://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I’m looking for libraries I can run and control myself locally, as opposed to web services.

(And whilst it’s always great to hear your fundamental objections of principle to profanity filtering, I’m not specifically looking for them here. I know profanity filtering can’t pick up every hurtful thing being said. I know swearing, in the grand scheme of things, isn’t a particularly big issue. I know you need some human input to deal with issues of content. I’d just like to find a good library, and see what use I can make of it.)

+1  A: 

Profanity? What the f***'s that? ;-)

It will still take a couple of years before a computer will really be able to recognize swearing and cursing and it is my sincere hope that people will have understood by then that profanity is human and not "dangerous."

Instead of a dumb filter, have a smart human moderator who can balance the tone of discussion as appropriate. A moderator who can detect abuse like:

"If you were my husband, I'd poison your tea." - "If you were my wife, I'd drink it."

(that was from Winston Churchill, btw.)

Aaron Digulla
Exactly. Profanity filters are pointless, at least until natural language parsers are much better.
delnan
Who downvotes this, and why?
delnan
@delnan: I guess because I asked what a good profanity filter library was, not whether I should use one at all. Suggestions like this can be better as comments, although they can be valid as answers too.
Paul D. Waite
@Aaron: yeah, I’m not planning to have the machine deal with profanity on its own. But rather than making a human being look at every damn thing on the site, it’d be nice if the machine could offer suggestions of what’s worth taking a look at. (That’s not a criticism of your answer, as I didn’t provide any explanation of what I was going to use the filter for.)
Paul D. Waite
@Aaron: oh, and I reckon it’ll be a lot longer than a couple of years before computers reliably understand English. And that the subset of people who care about the swears will not have gone away.
Paul D. Waite
I downvote this - I categorically disagree with the concept of profanity being neutral.
Paul Nathan
Aaron Digulla
Personally, as a moderator, I'd let that one through on account of sheer quality.
intuited
+15  A: 

I didn't found any Python profanity library, so I made one myself.

Parameters


filterlist

A list of regular expressions that match a forbidden word. Please do not use \b, it will be inserted depending on inside_words.

Example: ['bad', 'un\w+']

ignore_case

Default: True

Self-explanatory.

replacements

Default: "$@%-?!"

A string with characters from which the replacements strings will be randomly generated.

Examples: "%&$?!" or "-" etc.

complete

Default: True

Controls if the entire string will be replaced or if the first and last chars will be kept.

inside_words

Default: False

Controls if words are searched inside other words too. Disabling this

Module source


(examples at the end)

"""
Module that provides a class that filters profanities

"""

__author__ = "leoluk"
__version__ = '0.0.1'

import random
import re

class ProfanitiesFilter(object):
    def __init__(self, filterlist, ignore_case=True, replacements="$@%-?!", 
                 complete=True, inside_words=False):
        """
        Inits the profanity filter.

        filterlist -- a list of regular expressions that
        matches words that are forbidden
        ignore_case -- ignore capitalization
        replacements -- string with characters to replace the forbidden word
        complete -- completely remove the word or keep the first and last char?
        inside_words -- search inside other words?

        """

        self.badwords = filterlist
        self.ignore_case = ignore_case
        self.replacements = replacements
        self.complete = complete
        self.inside_words = inside_words

    def _make_clean_word(self, length):
        """
        Generates a random replacement string of a given length
        using the chars in self.replacements.

        """
        return ''.join([random.choice(self.replacements) for i in
                  range(length)])

    def __replacer(self, match):
        value = match.group()
        if self.complete:
            return self._make_clean_word(len(value))
        else:
            return value[0]+self._make_clean_word(len(value)-2)+value[-1]

    def clean(self, text):
        """Cleans a string from profanity."""

        regexp_insidewords = {
            True: r'(%s)',
            False: r'\b(%s)\b',
            }

        regexp = (regexp_insidewords[self.inside_words] % 
                  '|'.join(self.badwords))

        r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)

        return r.sub(self.__replacer, text)


if __name__ == '__main__':

    f = ProfanitiesFilter(['bad', 'un\w+'], replacements="-")    
    example = "I am doing bad ungood badlike things."

    print f.clean(example)
    # Returns "I am doing --- ------ badlike things."

    f.inside_words = True    
    print f.clean(example)
    # Returns "I am doing --- ------ ---like things."

    f.complete = False    
    print f.clean(example)
    # Returns "I am doing b-d u----d b-dlike things."
leoluk
Now that’s an answer.
Paul D. Waite
Profanity isn't primarily about words, but usage; most words which can be used as "profanity" have perfectly "clean" uses, and it takes a lot more than a regex to distinguish them. (Never mind, of course, that anything like this will only prompt people to w*rk ar*und it.)
Glenn Maynard
(I think it's pretty neat that just putting apostrophies in w*rds makes it look l*ke a *wear.)
Glenn Maynard
@Glenn: yes, we know. We know filtering isn’t a complete solution to whatever profanity problem one has. We just want to know what the decent libraries are.
Paul D. Waite
@Paul: Are you including leoluk in "we"? Any "decent library" is going to need to perform lexical analysis, bayesian heuristics or the like to discern different uses--not just run a regex. This code is cute, but isn't much more of a real-world solution than the bork below.
Glenn Maynard
@Glenn: I wouldn’t dream of speaking for the good fellow. And not necessarily — because computers don’t understand English, the library is not going to be able to do the entire job itself, it’s going to need human help. So running a regex may turn out to be the right balance between power and comprehensible code. Hence I say “good library” and “decent library”, not “magical perfect library”.
Paul D. Waite
@Paul: An approach that only searches for words without attempting to discern the context works fine for a small subset of language, but leaves a huge quantity of language undetectable. If blocking Carlin's list is all you want to do (to check a box on a feature requirement) then that's okay--but I think there's a significant area of practical analysis beyond that which can be done to make it something closer to practically useful. (@Brian's suggestion may be one, but I havn't tried it and they don't offer a public online demo. "Ask us for a demo" is never a promising sign.)
Glenn Maynard
@Glenn: “If blocking Carlin's list is all you want to do (to check a box on a feature requirement)” — who said I want to do either of those things? You’re right that there is a lot of potential for doing something more useful than just regexing for words, but you haven’t offered any of that in an answer yet.
Paul D. Waite
I pointed out that *this solution* is not very practically useful, and I did so because it seemed obvious that you were looking for something more than trivial word matching. I'm starting to regret wasting my time.
Glenn Maynard
+2  A: 

You can use Clean Speak from Inversoft to filter from Python via a WebService. Clean Speak is deployable software, so you can install it on your servers and don't have to worry about network hops or failures.

Brian Pontarelli
And it is difficult to work around it.
leoluk
I think you mean that network failures and latency are difficult to work around, which is true.
Brian Pontarelli
A: 

It's possible for users to work around this, of course, but it should do a fairly thorough job of removing profanity:

import re
def remove_profanity(s):
    def repl(word):
        m = re.match(r"(\w+)(.*)", word)
        if not m:
            return word
        word = "Bork" if m.group(1)[0].isupper() else "bork"
        word += m.group(2)
        return word
    return " ".join([repl(w) for w in s.split(" ")])

print remove_profanity("You just come along with me and have a good time. The Galaxy's a fun place. You'll need to have this fish in your ear.")
Glenn Maynard
+2  A: 

You could probably combine http://spambayes.sourceforge.net/ and http://www.cs.cmu.edu/~biglou/resources/bad-words.txt.

Meher
I love that list.
Paul D. Waite
Oh yes. Africa, Allah and heterosexual. (Was that list collected by a white gay christian?)
zoul
@zoul: shut up, you big he--FILTERED--al.
Paul D. Waite
+1  A: 

Here's my Rather Lazy Attempt at this...My idea was to use the difflib. Ofcourse this isn't the entire program...but it should get you started :)

import difflib,re
badwords = map(lambda x: x.strip(), open("words.txt","r").readlines())

body = open("text.txt","r").read();

p = re.compile("\w+")

for word in set(p.findall(body)):
    lst = difflib.get_close_matches(word,badwords,n=1,cutoff=0.9) #configure the cutoff accordin to how strict you want the filter
    if lst :
        body = body.replace(word,"!@#&@%^tutsifruitsy$%&$")

print body

The tutsifruitsy is a reference to "The Late Late Show with Craig Ferguson" :)

st0le