views:

304

answers:

6

I know that user agents are one indicator, but that's easy to spoof. What other reliable indicators are there that a visitor is really a bot? Inconsistent headers? Whether images/javascript are requested? Thanks!

+3  A: 

"Whether images/javascript are requested?" I would go for this one, however Google and others request images and javascript files nowadays.

How about request time speed? Bots read your content a lot faster than humans do.

Alix Axel
+1 - beat me to it
DarkSquid
A: 

Isn't that what captcha is invented for?

Janco
Trying to avoid having bots on your site is not a reason to make life harder for real users... captcha are really a pain, even the times they are not usefull against bots...
Pascal MARTIN
+4  A: 

CVSTrac uses a honeypot page to accomplish this. It's a page linked somewhere on the site where crawlers reach it, but humans usually ignore it. CVSTrac goes one step further by allowing the user to prove that he is human.

Filip Navara
+2  A: 

There are 4 things that we look for:

  • The user agent string. It is very easy to fake, but often crawlers will use their own unique user agent string.

  • The speed of access of pages, if they access more than one every half second or so, that's usually a good indication

  • If they request just the HTML, or if they request the entire page. Some crawlers will only ask for the HTML structure. This is usually a good tip off.

  • The incoming url

chollida
pt. 2: Be aware, that it is quite common (for me, at least) to follow several links from the same page within the same second (opening new tabs, obviously).
jensgram
@jensgram this is why we do it over several seconds and make the interval half a second. We have found it to be an almost perfect indicator. I also open several links at a time from a webpage.
chollida
Also, I frequently disable image downloading through a web developer plugin, when I am having connection issues and am interested only in reading text.
JYelton
A: 

Take a look at Bad Behavior, a library which employs a wide array of bot detection techniques

Frank Farmer
+2  A: 

A reverse captcha of sorts can help as well; you could create an text input field with display: none; in it's style attribute (or your stylesheet). If it's posted to, chances are you're dealing with a bot.

Edit: This was actually something that had been aggregated in my RSS reader, if I can find the source, I'll link a good example.

Akoi Meexx