views:

108

answers:

6

I have comments enabled on my site and I require users to enter at least 30 characters to publish their comments (Just to get some value because they usualy just submitted "I like it") But some users now use simple technique to overcome this and enter e.g.:

"I like it. asdsdf dfdsfsdf tt erretrt re"

As you can see the rest of the text is nonsense. Is there a way (algorithm) how to filter these comments out in PHP ?

+1  A: 

You can use a naive bayesian filter for this. http://www.paulgraham.com/better.html

There are probably existing libraries for this kind of thing. Check out spam assassin.

Joshua Smith
A: 

Personally, I would say there's not much you can do about it. Even if you had a dictionary and parser, what if I were to leave a comment: "I like it. As do I like your car." Depending on what they're leaving a comment for, that could be complete nonsense. Best I can say is have an edit available for each comment so that you or a mod or whomever can edit it. Sorry that this isn't of any help.

I had this same issue when trying to create password restrictions. Words couldn't be used, so we needed to use a dictionary, but there is never a comprehensive dictionary. And the biggest thing was eliminating l33t speak. :)

XstreamINsanity
+15  A: 

Get a dictionary of English words from the net. Check the post has a certain % (maybe 50%? maybe 70%?) of words that are in the dictionary. You can't look for 100%, or names and technical jargon will not be found.

users will get around this by entering.
I like it ....................................................
So then add logic to parse out punctuation.
Then users will get around it with
I like it. the the the the the the the the
Then you will need to parse it for proper English grammer
Then no one will be able to post on your site becuase it has too many rules.

Better suggestion: Add comment moderation. Dumb posts get downvoted and go away. Good posts stay.

bwawok
I was thinking while reading the answer, "Really? All that just to validate a comment. Someone needs to re-think the problem...". Then I read the last line. +1...
ircmaxell
A: 

Unfortunately not, your best bet is to modify something like this: Get Spelling Corrections From Google. When messages are close to the 80 character limit, you could look up each word individually and if it doesn't have a direct hit, boot out the input.

Troy Knapp
The the the The the the The the the The the the The the the The the the The the the The the the The the the The the the The the the The the the The the the will pass that.
bwawok
You're right, but the question was nonsensical text which I assumed to be words that weren't in the English vocabulary or misspelled. Obviously, human moderation is best, as you suggested, but failing that, this is the next best solution (definitely the simplest)... she could always implement a learning artificial neural net solution, which would be awesome, but much like killing an ant with an atom bomb.
Troy Knapp
A: 

I'd do a simple check on consecutive consonants or vowels. If there are more than four of any in a row, than there is a high probability of nonsense. Furthermore, check for more than two repetitions of the same character. When looking at some nonsense text, I'm sure you'll find some pragmatic reciepes ;-)

Regards

rbo

rubber boots
+2  A: 

Check out the Akismet PHP5 class.

$WordPressAPIKey = 'KEYHERE';
$MyBlogURL = 'http://www.example.com/blog/';

$akismet = new Akismet($MyBlogURL ,$WordPressAPIKey);
$akismet->setCommentAuthor($name);
$akismet->setCommentAuthorEmail($email);
$akismet->setCommentAuthorURL($url);
$akismet->setCommentContent($comment);
$akismet->setPermalink('http://www.example.com/blog/alex/someurl/');

if($akismet->isCommentSpam()) {}
simplemotives