Blocking Web Scrapers

views:

102

answers:

Blocking Web Scrapers

What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?

+3 A:

Captchas
Form submitted in less than a second
Hidden (by css) field gets a value submitted during form submit
Frequent page visits

Simple bots can not scrap text from flash, images or sound.

Hasan Khan 2010-08-05 07:08:59

All of those options (whist valid) could also block legal crawlers such as Google badly affecting your page rank, plus captchas would get in the way of normal users. Doesn't answer the question of how you could identify your site being accessed by a bot either.

Paul Hadfield 2010-08-05 07:45:39

+1 A:

You can use robots.txt to block bots that take notice of it (but still let through other known instances such as google, etc) - but that won't stop those that ignore it. You may be able to get the user agent from your web server logs, or you could update your code to record it somewhere. If you then wanted you could block particular user agents from accessing your website, just be returning either a empty/default screen and/or a particular server code.

Paul Hadfield 2010-08-05 07:37:56

I don't think there is a way of doing exactly what you need, because in websites crawlers/scrapers you can edit all headers when requesting a page, like User-Agent, and you won't be able to identify if there is a user from Mozilla Firefox or just a scraper/crawler...

tsocks 2010-08-23 00:41:26

Scrapers rely to some extent on the consistency of markup from page load to page load. If you want to make life difficult for them, come up with a means of serving altered markup from request to request.

Weston C 2010-08-23 00:59:01

+1 A:

Unfortunately your question is similar to people asking how do you block spam. There's no fixed answer, and it won't stop someone/bot which is persistent.

However, here are some methods that can be implemented:

Check User-Agent (this could be spoofed though)
Use robots.txt (proper bots will - hopefully respect this)
Detect IP addresses that access a lot of pages too consistently (every "x" seconds).
Manually, or create flags in your system to check who all are going on your site and block certain routes the scrapers take.
Don't use a standard template on your site, and create generic css classes - and don't put in HTML comments in your code.

Duniyadnd 2010-08-23 01:01:10

what is the reason for not using html comments in the code?

mt3 2010-10-26 00:32:54

It's easy to break the template into sections for the scraper even if you do change the layout of your code a little bit.

Duniyadnd 2010-10-27 17:32:38

Dean Higginbotham 2010-09-23 06:04:20

ansaurus

tags:

views:

answers:

Blocking Web Scrapers

related questions