views:

206

answers:

8

The ways I can think of are:

  1. Measure the time between actions.
  2. Compare the posts' content (if they're too similar to each other) or, better yet, only the posted links.
  3. Checking the distribution over a period of time the user is active (if the user is active, say posting once every hour, for a week, then either we have a superman or a bot here).
  4. Some special activity expected: like in stackoverflow, I would expect users to press their user name link (top middle) to see their new answers, comments, questions etc.
  5. (added by chakrit) Number of links in a post.
  6. Not heuristic. Use some async JS for user login. (Just makes life a bit harder on the bot programmer).
  7. (added by Alekc) Not heuristic. User-agent values.
  8. And, How could I forget Google's approach (mentioned down by Will Hartung). Give users the ability to mark someone as Spam, enough Spam votes means this is a Spam user. (calculating what is enough users, is the work here).

Any more ideas?

+3  A: 
  • The number of links in a post.

I believe I've read somewhere that Akismet use the number of links as one of its major heuristics.

And most of spam comments at my blog contains 10+ links in them.

Speaking of which... you just might want to check out the Akismet API itself .. they are extremely effective.

chakrit
+1. Links in post is a pretty good one. Again, you could cross refernce those with an blacklist from say Spamhaus.org
Dead account
+1  A: 

How about a search for spam related keywords in the post body?

Not a heuristic but an effective approach: You can also keep up-to-date with the stats published by StopForumSpam using their APIs.

dirkgently
+1  A: 

Time between page visits is common I believe.

I need to add a comment section to my personal site and am thinking of asking people to give me their email address; I'll email them a "publish comment" link.

You might want to check if they've come from a Spam blacklist IP address (See http://www.spamhaus.org/)

Dead account
Spamhaus looks promising :-) ... but I think I've had quiet a bad experience with blacklisting and proxies... :-(
chakrit
A: 

I have some doubts about 4° point, anyway i would also add User-Agent. It's pretty easy to fake, but in my experience, about 90% of bots are using Perl as UA

Alekc
4..This is heuristic, and one that is the most hard to implement, but you can gain from such a system much more then just spam filter (A system that tracks users behavior on your site).
Itay Moav
Yep, i meant that it's pretty hard to track down behavior patterns. I.e. here on stack overflow one can just navigate around the questions without clicking on their profile.
Alekc
+1  A: 

There is another answer that suggests using Akismet for detecting spam, which I completely endorse.

However, they are not the only player on the block.

There is TypePad AntiSpam which uses the same heuristics as Akismet, as well as the same API (just a different URL and api key, the structure of the calls is the same). It can be safe to say they pretty much take the same approach as Akismet.

You might also want to check out Project Honeypot. From what I can tell, it can do a lookup based on the IP address of the user, and if it is a known malicious IP, it will tell you (harvester or something like that).

Finally, you can check LinkSleeve which approaches comment spam with what it claims to be a different way. Basically, it checks the links that are being linked to in comments, and based on where the links are going to, makes a determination.

casperOne
Honeypot is rather filtering too aggressively. I've been blocked out of my own website once because of my ISP's proxy address was included in the list... totally a bummer for me :-( ... but +1 anyway :-)
chakrit
A: 

I am sure there is a webservice of some kind that you can get a list of top SEO keywords, check the content for those keywords. if the content is to rich in keywords suspect it as being spam.

Eric
+3  A: 

I might be over estimating the intelligence of bot creators, but number 6 is completely useless against any semi decent bot creator. Using the C# browser control to create your bot would pretty much render 6 useless. From what I've seen with that type of software that's a pretty common approach.

Validating on the useragent is pretty much useless too all of the blog spam I use to get was from bots appearing to be valid web browsers.

I use to get a lot of blog spam. I would literally be deleting hundreds of comments a day. I made use of reCaptcha and now I might get 1 a month.

If you really try to make something like this. I would attempt by doing the following:

User starts off with no ability to post a url.

After X number of posts have been analyzed in relation to the other posts in the thread then give them access to post urls.

The users activity on the site, the post quality, and what ever other factors you deem necessary will be a reputation for that users IP.

Then based the reputation of the IP and the other IPs on the same subnet you can make other decisions on whatever you want.

That was just the first thing that came to mind. Hope it helps.

William
+1 The ideas to restrict new users from posting urls, is a very good one, for my personal case. Not sure as to the generality of it.
Itay Moav
+1  A: 

Don't forget the ultimate heuristic: The "Report Spam" button that users can click. If nothing else, this gives you as administrator a chance to update your rule base for stuff that may be slipping through. Of course, you can simply delete the offending post and user right away as well.

Will Hartung
Note that this can be an abuse vector, most obviously if you let a user vote many times or allow votes from anonymous users - malicious users can use it to silence people they disagree with. To prevent this, consider manual review (possibly by trusted users) or at least an appeals process.
aem