views:

181

answers:

6

I am attempting to build a system that only shows users a CAPTCHA when bot-like behavior is detected. Here are the behaviors that I am currently looking for when somebody is filling out a contact form...

  1. how quickly the form is submitted after the page loads (if its 5 seconds or less, its almost humanely impossible to fill out)

  2. how many contact attempts have been made in the past hour (limit 15/hour), or day (limit 25/day)

  3. check message content for links, and cross-check links against other links recently included in the past day

  4. check message for spam keywords


I will add useful community solutions here as they come:


What other behaviors would be indicative of robots that PHP could help detect (don't want to use JS because it can be switched off) without the help of a CAPTCHA?

+3  A: 

A very simple one (some more advanced bots won't fall for this, but many basic bots will) - put a bogus field in the form that isn't visible to a regular user (and as a backup, perhaps with a normally invisible label "don't type anything here"). If there's content in the field when submitted, chances are it's a bot.

Amber
right, ive actually heard of that...a "honeypot"...if anybody is interested, ill edit my question, and add a link
johnnietheblack
+2  A: 

I believe you could coordinate with your robots.txt file, and determine IF it was hit by the user, this would then allow you to keep track of ip/timestamp of requestor, which would make it seem unlikely that a normal user would see your robots.txt file.

As most bots will check your robots.txt file (maybe for dir structure, etc).

Jakub
I think he is looking to keep out the kind of bots that pose as humans with shady or right-out criminal intentions, and don't check robots.txt.
Pekka
are you able to do this? i legitimately don't know if this is possible or not, but if you can describe it, or provide a link on how, its a good idea
johnnietheblack
some bots may check the robots.txt file to get directory structure. I didn't mean this for google/yahoo/search engine crawlers.
Jakub
+1  A: 

An interesting factor could be typing frequency and mouse movements. They are fairly easy to catch via JavaScript. Analyzing them is a different matter, although I imagine it would be fairly easy to calculate deviations and averages that give a good idea how "organic" the movements are.

On the other hand, this is extremely expensive on the client side and can be understood as snooping / spying if detected. Maybe as advanced security for clients that are suspected to be bots?

Pekka
not a bad idea, but if im a bot, i could easily just turn off javascript and bypass this, couldn't i?
johnnietheblack
Yes. This would only work when you have Javascript as a must.
Pekka
cooly, ill +1 for a cool idea...but i specifically need non-js solutions (i want to make this as airtight as possible, and my sites are not js-dependent)...thanks!
johnnietheblack
+1  A: 

Perhaps checking the referring url? I can hardly imagine alot of people ending up at a contact form without actually first going through several other pages in a website, same goes for order forms, ...

ChrisR
A: 

I added a hidden field (by CSS, display:none) to the form with name="email", when it is filled it was a robot ;)

powtac
A: 

I'd suggest forget trying to guess the signs...they are always changing.

I'd tokenize every imaginable 'feature' of the behaviour, automatically score the features with either, 'ok', 'spam' or 'unsure'. Then, 'Train on Error' (make a record of the cases where the guess was wrong). After a bit of time you could have 99.7 % accuracy.

Here is an example of the 7 most interesting features of a submission to my site that was scored at 89.9771 % spam. It is spam.

Each of these keywords found in the post are features that are 98.9% likely to be spam:

mssg txt - "tours" || Prob 0.98993 
mssg txt - "cruises" || Prob 0.98993
mssg txt - "agencies" || Prob 0.98993
mssg txt - "choice" || Prob 0.98991 

The telephone number that is '12345' is 95% likely to be spam

tel number - "123456" || Prob 0.95440 Delta 0.45440

The total length of the message being 30 characters (after html removed) is a feature that indicates 94% spam

mssg maxlen - "30" || Prob 0.94600 

(There was another feature that scored Prob 0.01011 which offset the total combined score knocking it down a bit. But, i am not gonna say what that feature was ;o)


It was submitted from a well known spam ip: http://www.projecthoneypot.org/ip_84.19.186.171 but there was no need to use that particular knowledge to mark it out as spam. I gather all sorts of info, like IPs, submissions rates etc ...but, as you can see, the most glaring signs of bot-like behavior are not what you might guess.

To build your own one of these .... read this: http://www.paulgraham.com/spam.html

JW