With text-recognition improving and CAPTCHA-breakers using Mechanical Turks to break otherwise unbreakable keys, what's the next technology to keep scripts from spam-botting a site that relies on user input?
I am a fan of limiting logins by using a credit card or cell phone SMS (like Craigslist and Gmail). These methods don't cost much (<$1), but can be highly effective in keeping spam accounts under control.
However, this is tricky on a site like SO because one of the founding goals is to have minimum friction and allow anonymous users to contribute. I guess that's where the throttling and voting comes into play.
I like the concept of an 'Invisible Captcha'. Phil Haack details one implementation here.
This banks on the fact that bots, spiders, and crawlers don't implement javascript engines. This too could change in the near future.
The most fundamental tool to keep people from spambotting a user input site is the "nofollow" tag on links. Most comment-spammers are interested in Google juice rather than actually having their stuff seen, so nofollow removes the incentive.
For now, reputation systems are harder to beat. The community sites of the near future will need to rely on its higher-ranking members to remove the spam.
The trend for spam is to become continually more indistinguishable from legitimate content, and for each new generation of mechanical filters to die of innefectiveness like overused antibiotics.
Even reputation systems will become useless as the spammers start maintaining sock-puppet farms to create their own high-ranking members, and when the community fights back the spammers will feed the churn of sock-puppets as if it was just another cost of doing business.
If you're going to build a site that takes user content, you'll either need to subscribe to the treadmill of neverending CAPTCHA-successors, or find a way to remove the incentive to spam your site in the first place.
The bar will keep being raised with problems that computers are bad at and humans are good at. Something like recognising emotions in a human face is something humans are particularly good at.
Another option could be along the lines of differentiating between disgusting or nice. It's totally subjective, but humans tend to hate rotten food, open wounds, poo, etc.
Negative turing test. Have used this for over a year on WordPress, IP.Board and MediaWiki sites, and have absolutely zero spam. The only catch: you have to think of a question/answer combination that's neither common (otherwise, bots will adapt) nor too domain-specific (otherwise, potential users might not know the answer).
Typically, for a site with resources of any value to protect, you need a 3-pronged approach:
- Throttle responses from authenticated users only, disallow anonymous posts.
- Minimize (not prevent) the few trash posts from authenticated users - e.g. reputation-based.
- Use server-side heuristic logic to identify spam-like behavior, or better non-human-like behavior.
Of course, a human moderator can also help, but then you have other problems - namely, flooding (or even drowning) the moderator, and some sites prefer the openness...
Robots are quite hard to defeat. On one website I was involved with, we didn't even use Captcha - just a field labelled "Leave this field blank". Robots always failed that really simple test.
The bigger problem is mass-human solving. There are lots of implementations whereby users solve screen-scraped captchas in return for something, like videos or images (you know what I mean). This means that there's a real human solving the captcha, so emotive, facial and more complex patterns are meaningless.
Multi-step processes will discourage this behaviour, but at the cost of making things harder for genuine visitors, which is sad when we're all trying to design websites that are more usable.