views:

3646

answers:

14
+4  Q: 

"bad words" filter

Not very technical, but... I have to implement a bad words filter in a new site we are developing. So I need a "good" bad words list to feed my db with... any hint / direction? Looking around with google I found this one, and it's a start, but nothing more.

Yes, I know that this kind of filters are easily escaped... but the client will is the client will !!! :-)

The site will have to filter out both english and italian words, but for italian I can ask my colleagues to help me with a community-built list of "parolacce" :-) - an email will do.

Thanks for any help.

+1  A: 

I'd suggest of white list of words that are accepted instead of a blacklist.

:) :) :) :)

Here is a block list:

http://support.discusware.com/center/resources/tips/dirtywords.zip

FlySwat
+2  A: 

I don't know of a list, but this would make for a great Akismet-like web service.

John Sheehan
+17  A: 

Beware of clbuttic mistakes.

AgentConundrum
I opened this thread with the intention of adding this same response...and then I realized that I gave you an upvote for it long ago :)
Ed Swangren
+9  A: 

I didn't see any language specified but you can use this for PHP it will generate a RegEx for each instered work so that even intentional mis-spellings (i.e. @ss, i3itch ) will also be caught.
I may even have a extensive set of words but I won't post that here, if you want it email me [email protected]

<?php

/**
 * @author [email protected]
 **/

if($_GET['act'] == 'do')
 {
    $pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';
    $pattern['b'] = '/[b]/'; $replace['b'] = '[b B I3 l3 i3]';
    $pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';
    $pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';
    $pattern['e'] = '/[e]/'; $replace['e'] = '[e E 3]';
    $pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';
    $pattern['g'] = '/[g]/'; $replace['g'] = '[g G 6]';
    $pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';
    $pattern['i'] = '/[i]/'; $replace['i'] = '[i I l ! 1]';
    $pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';
    $pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';
    $pattern['l'] = '/[l]/'; $replace['l'] = '[l L 1 ! i]';
    $pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';
    $pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';
    $pattern['o'] = '/[o]/'; $replace['o'] = '[o O 0]';
    $pattern['p'] = '/[p]/'; $replace['p'] = '[p P]';
    $pattern['q'] = '/[q]/'; $replace['q'] = '[q Q 9]';
    $pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';
    $pattern['s'] = '/[s]/'; $replace['s'] = '[s S $ 5]';
    $pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';
    $pattern['u'] = '/[u]/'; $replace['u'] = '[u U v V]';
    $pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';
    $pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';
    $pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';
    $pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';
    $pattern['z'] = '/[z]/'; $replace['z'] = '[z Z 2]';
    $word = str_split(strtolower($_POST['word']));
    $i=0;
    while($i < count($word))
     {
      if(!is_numeric($word[$i]))
      {
       if($word[$i] != ' ' || count($word[$i]) < '1')
        {
       $word[$i] = preg_replace($pattern[$word[$i]], $replace[$word[$i]], $word[$i]);
       }
      }
     $i++;
     }
    //$word = "/" . implode('', $word) . "/";
    echo implode('', $word);
 }

if($_GET['act'] == 'list')
 {
    $link = mysql_connect('localhost', 'username', 'password', '1');
    mysql_select_db('peoples');
    $sql = "SELECT word FROM filters";
    $result = mysql_query($sql, $link);
    $i=0;
    while($i < mysql_num_rows($result))
     {
     echo mysql_result($result, $i, 'word') . "<br />";
     $i++;
     }
     echo '<hr>';
 }
?>
<html>
    <head>
     <title>RegEx Generator</title>
    </head>
    <body>
     <form action='badword.php?act=do' method='post'>
      Word: <input type='text' name='word' /><br />
      <input type='submit' value='Generate' />
     </form>
     <a href="badword.php?act=list">List Words</a>
    </body>
</html>
Unkwntech
On't-day orget-day ig-pay atin-lay. Urse-cay ords-way are-ar ill-st ite-quay eadable-ray. (former owner of the AOL nick Itshay).
plinth
A: 

You could always convince the client to have a session of users just constantly posting expletives and make an easy solution to add them to the system. It is a lot of work but it will probably be more representative of the community.

Ross
+2  A: 

I don't believe in blacklists. I looked at the blacklist linked to in the question, and for one thing it lists "gay" as a bad word. You obviously need to know the context of things. Unless the website is for a conservative religious community in which case many words don't need a context to be considered offensive, I'd suggest using an "Offensive?" link as common in forums. This should be effective and have many less false positives.

wilhelmtell
+1  A: 

I would say to just remove posts as you become aware of them, and block users who are overly explicit with their postings. You can say very offensive things without using any swear words. If you block the word ass (aka donkey), then people will just type a$$ or /\55, or whatever else they need to type to get past the filter.

Kibbee
A: 

@mutable: I agree with you, the context is important; actually I'm going thru a complete revision of the dictionary. Thanks

@Unkwntech: Thanks fot the code snippet, I'm developing the whole thing in c# / sql server (actually the logic for the word matching is implemented in stored procedures) but the idea is good, thanks a lot.

I'll tell you when the site wil be up and running :-)

ila
+1  A: 

+1 on the Clbuttic mistake, I think it is important for "bad word" filters to scan for both leading and trailing spaces (e.g., " ass ") as opposed for just the exact string so that we won't have words like clbuttic, clbuttes, buttert, buttess, etc.

Jon Limjap
And don't block the town of Scunthorpe.
TRiG
A: 

I’m doing bad word filter like this codes below:

<?
function replace_bad_word($match) {

$filtered_word = $match[2];

$replacement = $match[1].substr($filtered_word, 0, 1);

for ($i = 1; $i < strlen($filtered_word) - 1; $i++) $replacement .= ‘*’;

$replacement .= substr($filtered_word, -1).$match[3];

return $replacement;

}
if (strpos($filter_comment, ‘,’) !== false) {

$regex = explode(’,', $filter_comment);

}

else {

$regex[] = trim($filter_comment);

}
//this can be placed in database like MySQL etc
$bad_word = ‘fuck, shit’;
if (strpos($bad_word, ‘,’) !== false) {

$regex = explode(’,', $bad_word);

}

else {

$regex[] = trim($bad_word);

}

$regex_array = array();

foreach ($regex as $word) {

$word = trim($word);

$regex = ”;

//split the word into its character

for ($i = 0; $i

as you see, the second word located in a sequence (fuccckkk) will not be affected by the filter. I’m trying hard to solve this problem, changing $regex[] to $regex_array[] = ‘#(^|\s+|[^\w+])((’.$regex.’)+)([^\w+]|\s+|$)#i’ (recursive regex) do nothing. Is there any one can help me please? Please email me into BandenX[at]gmail.com

A: 

It has occurred to me that this problem has a good shot at producing the largest regular expression ever, although this one puts on a good show.

Whatever
A: 

I recently found this free bad word filtering on-line webservice. They allow for WCF and Form Posts in JSON. It works really good and I dont have to update the list since their list is constantly updated. they have both free and affordable premium services available.

its worth a look.

http://www.thefilthylist.com

marzolo
A: 

In researching this topic I determined that what was needed was more than just a list that does arbitrary replacements. I have built a web service that allows you to identify the level of 'cleanliness' you desire. It also makes an effort to identify false positives - i.e. where a word may be bad in one context but not in others. Take a look at http://filterlanguage.com

Richard
A: 

Wikipedia ClueBot has a bad word filter, read its source.

http://en.wikipedia.org/wiki/User:ClueBot/Source#Score_list

SHiNKiROU