views:

78

answers:

3

Lots of spiders/crawlers visit our news site. We depend on GeoIP services to identify our visitors' physical location and serve them related content. So we developed a module with module_init() function that sends IP to MaxMind and sets cookies with location information. To avoid sending requests on each page view, we first check whether the cookie is set, and if not, we send for information and set the cookie. This works fine with regular clients but doesn't work as well when a spider crawls through the site. Each pageview prompts a query to MaxMind and this activity becomes somewhat expensive. We are looking for a solution to identify crawlers or, if easier, legit browsers with cookies enabled and query MaxMind only when it's useful.

+1  A: 

Spiders and crawlers usually have a distinct user-agent, maybe you can filter on that?

Leon
Thank you. You gave me an idea. I write out a log with each request with user agent. Filtering the obvious ones is easy. Thanks.
LymanZerga
+3  A: 

Well there is not just one thing to do to be honest. I would suggest what I have done in the past to combat this same issue. use a browser detection script there are a tone of classes out there for detecting browsers. Then check the browser against a db of known browsers. Then if the browser is in your list allow the call to the service if not use a "best guess" script.

By this I mean something like this:

Generic ip lookup class

So what you are doing is in the event that a browser type is not in your list it will not use your paid services DB instead it uses this class which can get as close as possible. This way you get the best of both worlds bots are not racking up hits on your ip service and if a user does slip past your browser check for some reason they will most likely get a correct location and thus appearing as normal on your site.

This is a little jumpy I know I just hope you get what I am trying to say here.

The real answer is that there is no easy answer or 100% right answer to this issue, I have done many sites with the same situation and have went insane trying to figure it out and this is as close to perfect as I have come. Since 99% of most ligit crawlers will have a value like so:

$_SERVER['HTTP_USER_AGENT'] = 'Googlebot', 'Yammybot', 'Openbot', 'Yahoo'... etc.

A simple browser check will do but it is the shady ones that may respond with IE6 or something.

I really hope this helps like I said there is no real answer here at least not that I have found to be 100%, it's kind of like finding out if a user is on a hand-held these days you can get 99% there but never 100% and it always works out that the client uses that 1% that doesn't work lol.

BrandonS
Thank you. I am filtering legit crawlers by writing out user agents to log and adding conditions. Low tech but works.Thanks!
LymanZerga
Great, glad i could help.
BrandonS
A: 

Detecting webcrawlers (both legit and nefarious) can be done with the ATL webcrawler API at www.atlbl.com

Mark