views:

790

answers:

12

Hi, I need to write some code to analyze whether or not a given user on our site is a bot. If it's a bot, we'll take some specific action. Looking at the User Agent is not something that is successful for anything but friendly bots, as you can specify any user agent you want in a bot. I'm after behaviors of unfriendly bots. Various ideas I've had so far are:

  • If you don't have a browser ID
  • If you don't have a session ID
  • Unable to write a cookie

Obviously, there are some cases where a legitimate user will look like a bot, but that's ok. Are there other programmatic ways to detect a bot, or either detect something that looks like a bot? thanks!

+1  A: 

You say that it is okay that some users appear as bots, therefore,

Most bots don't run javascript. Use javascript to do an Ajax like call to the server that identifies this IP address as NonBot. Store that for a set period of time to identify future connections from this IP as good clients and to prevent further wasteful javascript calls.

Rob Prouse
+1  A: 

A simple test is javascript:

<script type="text/javascript">
document.write('<img src="/not-a-bot.' + 'php" style="display: none;">');
</script>

The not-a-bot.php can add something into the session to flag that the user is not a bot, then return a single pixel gif.

The URL is broken up to disguise it from the bot.

Greg
Only difficulty is that lots of users now turn javascript off, given security concerns. It's almost humorous that with it would be one of the easiest ways to test for authenticity.
The Wicked Flea
Really? With javascript off there's a ton of sites that just don't work nowadays. I thought more users are running with javascript ON as time progressed.
Zachary Yates
When using Firefox I have noscript active most times. So going to a site with a setup like this would flag me as a bot from the get go.
Dalin Seivewright
@zachary, the 'problem' is that now more and more good web developers are using progressive enhancement to at least give a half decent experience so NOSCRIPT is apparently (although i've never tried it) a workable solution. i wish people weren't so paranoid. it makes so many otherwise easy things just frustratingly hard
Simon_Weaver
People are paranoid for good reason. There are a lot of security vulnerabilities these days that start or propagate through javascript (XSRF being a huge one right now). If more web developers were progressive in their client-server interactions, the paranoia would be less likely (but still justified).
patridge
+5  A: 

Clarify why you want to exclude bots, and how tolerant you are of mis-classification.

That is, do you have to exclude every single bot at the expense of treating real users like bots? Or is it okay if bots crawl your site as long as they don't have a performance impact?

The only way to exclude all bots is to shut down your web site. A malicious user can distribute their bot to enough machines that you would not be able to distinguish their traffic from real users. Tricks like JavaScript and CSS will not stop a determined attacker.

If a "happy medium" is satisfactory, one trick that might be helpful is to hide links with CSS so that they are not visible to users in a browser, but are still in the HTML. Any agent that follows one of these "poison" links is a bot.

erickson
If the user had some sort of Web Accelerator installed, then it still might visit the invisible links, if the web accelerator wasn't extremely smart.
Kibbee
+2  A: 

User agents can be faked. Captchas have been cracked. Valid cookies can be sent back to your server with page requests. Legitimate programs, such as Adobe Acrobat Pro can go in and download your web site in one session. Users can disable JavaScript. Since there is no standard measure of "normal" user behaviour, it cannot be differentiated from a bot.

In other words: it can't be done short of pulling the user into some form of interactive chat and hope they pass the Turing Test, then again, they could be a really good bot too.

Diodeus
A: 

Well, this is really for a particular page of the site. We don't want a bot submitting the form b/c it messes up tracking. Honestly, the friendly bots, Google, Yahoo, etc aren't a problem as they don't typically fill out the form to begin with. If we suspected someone of being a bot, we might show them a captcha image or something like that... If they passed, they're not a bot and the form submits...

I've heard things like putting a form in flash, or making the submit javascript, but I'd prefer not to prevent real users from using the site until I suspected they were a bot...

A: 

I think your idea with checking the session id will already be quite useful.

Another idea: You could check whether embedded resources are downloaded as well.

A bot which does not load images (e.g. to save time and bandwidth) should be distinguishable from a browser which typically will load images embedded into a page.

Such a check however might not be suited as a real-time check because you would have to analyze some sort of server log which might be time consuming.

0xA3
IE and Firefox at least have the ability to not download images.
Dalin Seivewright
Safari also has the option to disable images.
epochwolf
Lynx. Don't forget Lynx. Which nobody uses. But which *can* submit forms. Yeah...
Brian
Yes, there is no perfect way. But I guess with a combination of several methods such as checking for scripting, image downloads, CSS tricks etc you could make it much harder for an evil bot...
0xA3
A: 

For each session on the server you can determine if the user was at any point clicking or typing too fast. After a given number of repeats, set the "isRobot" flag to true and conserve resources within that session. Normally you don't tell the user that he's been robot-detected, since he'd just start a new session in that case.

krosenvold
This wouldn't be foolproof, since many legitimate software solutions exist to automatically fill out web forms on a user's behalf.
sep332
Well nothing's foolproof, but then again you just give a slightly lower QOS to that session. We'd only do this after a few pages of inhumanly fast behaviour
krosenvold
A: 

Hey, thanks for all the responses. I think that a combination of a few suggestions will work well. Mainly, the hidden form element that times how fast the form was filled out, and possibly the "poison link" idea. I think that it will cover most basis. When you're talking about bots, you're not going to find them all, so there's no point thinking that you will... Silly bots.

Well, "silly bots" except for Google - without which many sites wouldn't get any traffic at all :)
CraigD
A: 

this seems to be a really complicated problem.

Geshan
A: 

There is an API available at www.atlbl.com that can identify webcrawlers (both good and bad) aswell as other forms of automated webbots.

+1  A: 

Here's an idea:

Most bots don't download css, javascript and images. They just parse the html.

If you could keep track in a user's session whether or not they download all of the above, e.g. by routing all of the download requests through a script that logs the attempts, then you could quickly identify users that only download the raw html (very few normal users will do this).

Finbarr
A: 

Make mouse event? Bot's haven't mouse.

r4ge