tags:

views:

845

answers:

5

I have been noticing on my trackers that bots are visiting my site ALOT. Should I change or edit my robots.txt or change something? Not sure if thats good, because they are indexing or what?

+3  A: 

Should i change or edit my robots.txt or change something?

Depends on the bot. Some bots will dutifully ignore robots.txt. We had a similar problem 18 months ago with the google AD bot because our customer was purchasing Soooo many ads. Google AD bots will (as documented) ignore wildcard (*) exclusions, but listen to explicit ignores.

Remember, bots that honor robots.txt will just not crawl your site. This is undesirable if you want them to get access to your data for indexing.

A better solution is to throttle or supply static content to the bots.

Not sure if thats good, because they are indexing or what?

They could be indexing/scraping/stealing. All the same really. What I think you want is to throttle their http request processing based on UserAgents. How to do this depends on your web server and app container.

As suggested in other answers, if the bot is malicious, then you'll need to either find the UserAgent pattern and send them 403 forbiddens. Or, if the malicious bots dynamically change user agent strings you have a two further options:

  • White-list UserAgents - e.g. create a user agent filter that only accepts certain user agents. This is very imperfect.
  • IP banning - the http header will contain the source IP. Or, if you're getting DOS'd (denial of service attack), then you have bigger problems
CVertex
Gotta' disagree with item 1 Google obeys the robots.txt very well.
Unkwntech
Not true when we had this problem 18 months ago (with www.mytickets.com.au). It was an AD bot from google that was constantly checking for new resources. I'll check my source for this again
CVertex
You are right. The case I was thinking of was this: Google ad bots ignore the wildcard case (*)
CVertex
I wouldn't count on a simplistic throttling by UA. I've seen "bad" bots rotate UAs every few requests.
Eli
+3  A: 

I really don't think changing the robots.txt is going to help, because only GOOD bots abide by it. All other ignore it and parse your content as they please. Personally I use http://www.codeplex.com/urlrewriter to get rid of the undesirable robots by responding with a forbidden message if they are found.

Nick Berardi
+3  A: 

The spam bots don't care about robots.txt. You can block them with something like mod_security (which is a pretty cool Apache plugin in its own right). Or you could just ignore them.

Eli
+2  A: 

You might have to use .htaccess to deny some bots to screw with your logs. See here : http://spamhuntress.com/2006/02/13/another-hungry-java-bot/

I had lots of Java bots crawling my site, adding

SetEnvIfNoCase User-Agent ^Java/1. javabot=yes
SetEnvIfNoCase User-Agent ^Java1. javabot=yes
Deny from env=javabot

made them stop. Now they only get 403 one time and that's it :)

Patrick Geiller
+2  A: 

I once worked for a customer who had a number of "price comparison" bots hitting the site all of the time. The problem was that our backend resources were scarce and cost money per transaction.

After trying to fight off some of these for some time, but the bots just kept changing their recognizable characteristics. We ended up with the following strategy:

For each session on the server we determined if the user was at any point clicking too fast. After a given number of repeats, we'd set the "isRobot" flag to true and simply throttle down the response speed within that session by adding sleeps. We did not tell the user in any way, since he'd just start a new session in that case.

krosenvold