views:

102

answers:

6

What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?

+3  A: 
  • Captchas
  • Form submitted in less than a second
  • Hidden (by css) field gets a value submitted during form submit
  • Frequent page visits

Simple bots can not scrap text from flash, images or sound.

Hasan Khan
All of those options (whist valid) could also block legal crawlers such as Google badly affecting your page rank, plus captchas would get in the way of normal users. Doesn't answer the question of how you could identify your site being accessed by a bot either.
Paul Hadfield
+1  A: 

You can use robots.txt to block bots that take notice of it (but still let through other known instances such as google, etc) - but that won't stop those that ignore it. You may be able to get the user agent from your web server logs, or you could update your code to record it somewhere. If you then wanted you could block particular user agents from accessing your website, just be returning either a empty/default screen and/or a particular server code.

Paul Hadfield
A: 

I don't think there is a way of doing exactly what you need, because in websites crawlers/scrapers you can edit all headers when requesting a page, like User-Agent, and you won't be able to identify if there is a user from Mozilla Firefox or just a scraper/crawler...

tsocks
A: 

Scrapers rely to some extent on the consistency of markup from page load to page load. If you want to make life difficult for them, come up with a means of serving altered markup from request to request.

Weston C
+1  A: 

Unfortunately your question is similar to people asking how do you block spam. There's no fixed answer, and it won't stop someone/bot which is persistent.

However, here are some methods that can be implemented:

  1. Check User-Agent (this could be spoofed though)
  2. Use robots.txt (proper bots will - hopefully respect this)
  3. Detect IP addresses that access a lot of pages too consistently (every "x" seconds).
  4. Manually, or create flags in your system to check who all are going on your site and block certain routes the scrapers take.
  5. Don't use a standard template on your site, and create generic css classes - and don't put in HTML comments in your code.
Duniyadnd
what is the reason for not using html comments in the code?
mt3
It's easy to break the template into sections for the scraper even if you do change the layout of your code a little bit.
Duniyadnd
A: 
Dean Higginbotham