views:

274

answers:

6

Possible Duplicate:
How do you implement a good profanity filter?

I have a classifieds website, and when displaying a classified, users have the option of mailing a message to the poster of the classified.

I need to check this message against bad words and unseirousness before sending it.

Firstly, how can I check some text against a text-file with php? This is in Swedish language so there are three special characters also... I am mentioning this so special characters wont be an issue when checking against the text-file.

I could use an array also, and do the checking in php, would this be preferred? Maybe faster?

Secondly, do you have any tips on how to check a message against "unseriousness"?

Thanks

+1  A: 

Sounds like you need some sort of moderation to catch "unseriousness". Make the classified peer moderated in that they can flag inappropriate or spam messages. Ultimately, you would still need to moderate or hire someone to moderate your site.

Shiftbit
A: 

Found this code on this page:

function ReplaceBadWords($str, $bad_words, $replace_str){
    if (!is_array($bad_words)){ $bad_words = explode(',', $bad_words); }
    for ($x=0; $x < count($bad_words); $x++){
        $fix = isset($bad_words[$x]) ? $bad_words[$x] : '';
        $_replace_str = $replace_str;
        if (strlen($replace_str)==1){ 
            $_replace_str = str_pad($_replace_str, strlen($fix), $replace_str);
        }
        $str = preg_replace('/'.$fix.'/i', $_replace_str, $str);
    }
    return $str;
}


/*example-start*/

/*
** First example:
*/

// create some test "bla bla"
$str = <<<EOF
This is an test paragraph,
to test this badwords function.
Some bad words: fuck shit.
Sorry for that ;-)
EOF;

// this string will be used to replace the 
// bad words:
$replace_str = "@#$*!";

// create a array with words to replace:
$bad_words = array('shit','fuck');

// execute the function:
print ReplaceBadWords($str, $bad_words, $replace_str);
print "<hr/>\n";


/*
Another example:

 This tiny example shows to alternatives:

 1. You can use a string as source for bad words 
 2. If you give a "replace string" with the length of one letter
    it will automatically repeated to match the bad word length
*/

print ReplaceBadWords($str, 'fuck,shit,paragraph', '*');

/*example-end*/

About unseriousness, it will need people intervention to accurately doing so...

Garis Suero
Clbuttic idea...
Tim Pietzcker
+3  A: 

Word filters don't work. They don't stop anyone from posting "unseriously" (or sexually sugestive content or warez or x or y or z or ...), because that's mostly unrelated to some combinations of letters. To truly catch all of them, you need aware moderators. You need semantical analysis. Word filters don't do semantic analysis, they ban words based on how the author of the list thinks they will be used commonly. They don't catch messages violating certain rules concerning their content, they ban strings whose contents may or may not violate these rules. For example, about every bad word filter would break a message concerning tits -- Eeek he said "tits"! -- Jeez, I mean the birds! what did you think? ;-) Any human reading this will know that I'm not spamming or spreading pr0n, but a bad word filter would delete this post or replace "tit" with e.g. "breast", making you wonder what I was talking about after that link.

Only solution: Moderators. Moderators which got lots of time and can ban people.

delnan
+1: When I worked for a big social networking site we had plenty of profanity filtering, but it was only to show our clients we were doing *something* in that regard. If they *really* wanted to keep their forums safe for the little old lady from Peoria, they had to pony up for the moderation package.
Robusto
+1  A: 

Firstly, how can I check some text against a text-file with php? This is in Swedish language so there are three special characters also... I am mentioning this so special characters wont be an issue when checking against the text-file.

This is the easy part: The only two character encodings commonly used for Swedish are UTF-8 and ISO-8859-1 (or the related ISO-8859-15 and windows-1252, but they use the same encoding for "å", "ä", and "ö"). And UTF-8 is easy to detect.

Please, answer firstly the "bad words" part... this shouldn't be so hard right?

No, it's not easy at all, because of

Plus, some of the "bad words" depend on context. Are you going to reject phrases like "cock-a-doodle-doo", "Dick Nixon", and "pussy willow"?

Secondly, do you have any tips on how to check a message against "unseriousness"?

Like profanity filtering, this would be difficult to do automatically. You'll probably want to have a human moderator.

If you want a programmatic way to flag "unseriousness", perhaps you could use a heuristic like Bayesian spam filtering.

dan04
+2  A: 

This is not an answer, but just a remark which, I hope, is useful here.

By experience, all bad words replacement algorithms through dictionary fail for several reasons:

  • People are always more intelligent than basic filters. If they want to write a word "fuck" and this word is blocked, they may try "fùck", "f-u-c-k", "fUcK", etc. Blocking all those alternatives will be extremely difficult.

  • There are too many possibilities. If you want to block "fuck", you may also block "fucking", fuck'n", "fucker", "fuckers", etc. Just for one word, you can imagine probably dozens of variants.

  • A bad word may be a part of a good word. Can't imagine a good example, but I saw several times when the very correct word was cut by two with an "@#$*!" in the middle, just because the letters matched a word from a dictionary.

So probably the best way is to let the users enter everything they like, and invest instead in creating a well-done tracking and banishing system with humans as moderators. An impolite person can always circumvent those dictionary limitations, but will always be banned by a moderator.

There is an example of a good filter: GMail spam filter (partially based on humans and on a huge number of submitted mails). But I doubt developing a similar system for your project is a solution in your case.

MainMa
+1 forget automatic filtering. Make it easy to moderate content instead.
Pekka
Examples of "bad words" that are part of perfectly acceptable, normal words... in this case, names of towns in England: penistone, arsenal, scunthorpe... there were a lot of problems on AOL many years ago when they introduced such a filter that censored a lot of discussions of English soccer
Mark Baker
+7  A: 

There is no filter that can be made which will detect the following and (its many variants)*:

Friends
Uneducated by the
Computer cognoscenti will not
Know that profanity filters are

Yet another
Obviously ineffective tool of the
Unenlightenened.

*If you don't get it, keep trying.

Robusto
+1 for excellent example.
delnan
Even a human won't always catch this one. http://www.snopes.com/photos/signs/headstone.asp
dan04