tags:

views:

43

answers:

3

Due to some rather bizarre architectural considerations I've had to set up something that really ought to run as a console application as a web page. It does the job of writing a large variety of text files and xml feeds from our site data for various other services to pick up so obviously it takes a little while to run and is pretty processor intensive.

However, before I deploy it I'm rather worried that it might get hit repeatedly by spiders and the like. It's fine for the data to be re-written but continual hits on this page are going to trigger performance issues for obvious reasons.

Is this something I ought to worry about? Or in reality is spider traffic unlikely to be intensive enough to cause problems?

A: 

You should require authentication for the page.

Even if you exclude it in robots.txt, there's no guarantee that spiders will respect that. If it's an expensive page that might impact site availability, stick it behind an authentication gateway.

jemfinch
+1  A: 

You can tell the big ones not to spider you; www.robotstxt.org.

You could also implement some form of authentication/ip address criteria that would prevent it from running.

Alex K.
+1  A: 

You might be surprised how many spiders there are out there.

You should use robots.txt to exclude them.

If you worry that spiders might ignore robots.txt (and some inevitably will), how about requiring a POST rather than a GET to trigger the script? That should exclude all spiders.

RichieHindle