What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?
- Captchas
- Form submitted in less than a second
- Hidden (by css) field gets a value submitted during form submit
- Frequent page visits
Simple bots can not scrap text from flash, images or sound.
You can use robots.txt to block bots that take notice of it (but still let through other known instances such as google, etc) - but that won't stop those that ignore it. You may be able to get the user agent from your web server logs, or you could update your code to record it somewhere. If you then wanted you could block particular user agents from accessing your website, just be returning either a empty/default screen and/or a particular server code.
I don't think there is a way of doing exactly what you need, because in websites crawlers/scrapers you can edit all headers when requesting a page, like User-Agent, and you won't be able to identify if there is a user from Mozilla Firefox or just a scraper/crawler...
Scrapers rely to some extent on the consistency of markup from page load to page load. If you want to make life difficult for them, come up with a means of serving altered markup from request to request.
Unfortunately your question is similar to people asking how do you block spam. There's no fixed answer, and it won't stop someone/bot which is persistent.
However, here are some methods that can be implemented:
- Check User-Agent (this could be spoofed though)
- Use robots.txt (proper bots will - hopefully respect this)
- Detect IP addresses that access a lot of pages too consistently (every "x" seconds).
- Manually, or create flags in your system to check who all are going on your site and block certain routes the scrapers take.
- Don't use a standard template on your site, and create generic css classes - and don't put in HTML comments in your code.