tags:

views:

267

answers:

5

we have a situation where we log visits and visitors on page hits and bots are clogging up our database. We can't use captcha or other techniques like that because this is before we even ask for human input, basically we are logging page hits and we would like to only log page hits by humans.

Is there a list of known bot IP out there? Does checking known bot user-agents work?

A: 

I think many bots would be identifiable by user-agent, but surely not all of them. A list of known IPs - I wouldn't count on it either.

A heuristic approach might work. Bots are usually much quicker at following links than people. Maybe you can track each client's IP and detect the average speed with which it following links. If it's a crawler it probably follows every link immediately (or at least much faster than humans).

Assaf Lavie
+3  A: 

Depending on the type of bot you want to detect:

RHSeeger
are web crawlers categorized as bots?
berkay
A: 

i don't think there will be a list of Botnet IP addresses, Botnet IP addresses is not static, and nobody knows who are the bots including the users that are behaving like Bots.

Your question is arguably hot research area right now, i'm curious if someone could give a solution for that problem.

You can use any kind of technique and understand if this is a human or not, then you can get the logs.

berkay
+1  A: 

Have you already added a robots.txt? While this won't solve for malicious bot use you might be surprised at the legitimate crawling activity already occurring on your site.

cfeduke
+3  A: 

There is no sure-fire way to catch all bots. A bot could act just like a real browser if someone wanted that.

Most serious bots identify themselves clearly in the agent string, so with a list of known bots you can fitler out most of them. To the list you can also add some agent strings that some HTTP libraries use by default, to catch bots from people who don't even know how to change the agent string. If you just log the agent strings of visitors, you should be able to pick out the ones to store in the list.

You can also make a "bad bot trap" by putting a hidden link on your page that leads to a page that's filtered out in your robots.txt file. Serious bots would not follow the link, and humans can't click on it, so only bot that doesn't follow the rules request the file.

Guffa