tags:

views:

1059

answers:

7

I want to prevent automated html scraping from one of our sites while not affecting legitimate spidering (googlebot, etc.). Is there something that already exists to accomplish this? Am I even using the correct terminology?

EDIT: I'm mainly looking to prevent people that would be doing this maliciously. I.e. they aren't going to abide by robots.txt

EDIT2: What about preventing use by "rate of use" ... i.e. captcha to continue browsing if automation is detected and the traffic isn't from a legitimate (google, yahoo, msn, etc.) IP.

+5  A: 

This is difficult if not impossible to accomplish. Many "rogue" spiders/crawlers do not identify themselves via the user agent string, so it is difficult to identify them. You can try to block them via their IP address, but it is difficult to keep up with adding new IP addresses to your block list. It is also possible to block legitimate users if IP addresses are used since proxies make many different clients appear as a single IP address.

The problem with using robots.txt in this situation is that the spider can just choose to ignore it.

EDIT: Rate limiting is a possibility, but it suffers from some of the same problems of identifying (and keeping track of) "good" and "bad" user agents/IPs. In a system we wrote to do some internal page view/session counting, we eliminate sessions based on page view rate, but we also don't worry about eliminating "good" spiders since we don't want them counted in the data either. We don't do anything about preventing any client from actually viewing the pages.

Sean Carpenter
+1 robots.txt will not get the job done if the spider is malicious. You will need to block them at the firewall by IP or user agent string, but unfortunately (as you noted) this can be quite difficult to keep up with.
Andrew Hare
It would be best to create a HTML Module to filter out the malicious scripts based on request rates, IP's, whatever.
Todd
If you use an HTTPModule then you are opening yourself up to a possible DOS attack.
Andrew Hare
You could also blacklist spiders who don't honor the robots.txt, but this would require a decent amount of coding.
skirmish
How does using a HttpModule open yourself up to a DOS attack? Here's an article that says Modules are the best way to *prevent* them: http://msmvps.com/blogs/omar/archive/2007/03/24/prevent-denial-of-service-dos-attacks-in-your-web-application.aspx
Todd
And here's another one: http://www.webpronews.com/expertarticles/2007/01/19/aspnet-easily-block-dos-attacks
Todd
+1  A: 

robots.txt only works if the spider honors it. You can create a HttpModule to filter out spiders that you don't want crawling your site.

Todd
Agreed. So long as you can identify good spiders, like by their user agent, you don't need to worry about how to identify bad ones. If it's requesting too often and isn't a good spider than filter it out.
Ben Daniel
Ignoring robots.txt is exactly what reveals the "rogue" spider, see my answer about honeypot.
Constantin
A: 

You should do what good firewalls do when they detect malicious use - let them keep going but don't give them anything else. If you start throwing 403 or 404 they'll know something is wrong. If you return random data they'll go about their business.

For detecting malicious use though, try adding a trap link on search results page (or the page they are using as your site map) and hide it with CSS. Need to check if they are claiming to be a valid bot and let them through though. You can store their IP for future use and a quick ARIN WHOIS search.

DavGarcia
+3  A: 

One approach is to set up an HTTP tar pit; embed a link that will only be visible to automated crawlers. The link should go to a page stuffed with random text and links to itself (but with additional page info: /tarpit/foo.html , /tarpit/bar.html , /tarpit/baz.html - but have the script at /tarpit/ handle all requests with the 200 result).

To keep the good guys out of the pit, generate a 302 redirect to your home page if the user agent is google or yahoo.

It isn't perfect, but it will at least slow down the naive ones.

EDIT: As suggested by Constantin, you could mark the tar pit as offlimits in robots.txt. The good guys use web spiders that honor this protocol will stay out of the tar pit. This would probably get rid of the requirement to generate redirects for known good people.

Tim Howland
+1, but to keep the good guys out of the pit you should use robots.txt instead of easily forgeable user-agent string.
Constantin
good point, I'll add that in.
Tim Howland
+2  A: 

If you want to protect yourself from generic crawler, use a honeypot.

See, for example, http://www.sqlite.org/cvstrac/honeypot. The good spider will not open this page because site's robots.txt disallows it explicitly. Human may open it, but is not supposed to click "i am a spider" link. The bad spider will certainly follow both links and so will betray its true identity.

If the crawler is created specifically for your site, you can (in theory) create a moving honeypot.

Constantin
A: 

I have developed a .NET component that can be dropped in as a .dll

This then intercepts inbound requests and will 100% verify the top 4 search engines

Google MSN/Bing Yahoo Ask Jeeves

It does this using a round trip DNS lookup.

That's phase 1.

Phase 2 is a little more involved but will stop any crawling bot within it's first 3 page requests normally. That bot will then be blocked for 48 hours. This block is based on both Useragent and IP. This step will identify any bot that is not obeying robots.txt by telling Disallow: PageX.htm in robots.txt and then having a link on each page to that file.

The file doesn't actually need to exist because you tell the component the filename and it will intercept those requests.

I deployed this on my companies server at the beginning of Feb (now beginning March) and we have seen a 72% drop in server resource usage compared to the previous 3 months.

Hades
A: 

will you share your dll?

Brian Perry