views:

58

answers:

3

I'm developing a website and am sensitive to people screen scraping my data. I'm not worried about scraping one or two pages -- I'm more concerned about someone scraping thousands of pages as the aggregate of that data is much more valuable than a small percentage would be.

I can imagine strategies to block users based on heavy traffic from a single IP address, but the Tor network sets up many circuits that essentially mean a single user's traffic appears to come from different IP addresses over time.

I know that it is possible to detect Tor traffic as when I installed Vidalia with its Firefox extension, google.com presented me with a captcha.

So, how can I detect such requests?

(My website's in ASP.NET MVC 2, but I think any approach used here would be language independent)

+5  A: 

I'm developing a website and am sensitive to people screen scraping my data

Forget about it. If it's on the web and someone wants it, it will be impossible to stop them from getting it. The more restrictions you put in place, the more you'll risk ruining user experience for legitimate users, who will hopefully be the majority of your audience. It also makes code harder to maintain.

I'll post countermeasures to any ideas future answers propose.

Aillyn
I'm in agreement with Aillyn; it will be near-impossible to stop somebody from screen-scraping your site. Pursuing options to prevent it will merely consume time better spent improving other aspects of your site. Focus on things that make your site unique and better than the screen-scrapers. Look at Stack Overflow for instance: it is being scraped by tons of bottom-feeders, but that doesn't prevent it from being useful or awesome.
Cal Jacobson
@Cal They don't even have to scrape it, the content is made available thru the [data dumps](http://blog.stackoverflow.com/category/cc-wiki-dump/).
Aillyn
@Cal, SO data is available as a download under Creative Commons http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/
Drew Noakes
@Aillyn ...snap
Drew Noakes
@Aillyn, I'm with you that it's impossible to _stop_ people taking the data. Mostly I'm just interested in making it very hard for people to do so. I can envisage very simple blocking behaviour that wouldn't impede any human visitor, but that falls apart if someone's using the Tor network, or other distributed proxy. Thanks for your answer, if not your encouragement :)
Drew Noakes
@Aillyn, Drew -- Ah, I didn't know that. I think that only reinforces my point, however: despite the content being readily available, SO offers something that the imitators can't seem to provide. Maybe it's simply reputation. Maybe its timeliness. Maybe it's design. Maybe its the fact that a typical user can spot the crapfest knock-offs out there, see them for what they are, and choose to support the real deal. Whatever it is, Drew, don't let screen-scraping alone dissuade you from going forward with your site.
Cal Jacobson
@Drew I used this service that wouldn't let people copy their content. On top of tons of legal mumbo-jumbo (ie: you will be persecuted to the fullest extent of the law if you copy our content), the program ran in Java and cleared the clipboard, and also checked for processes running that were image capture programs. Very annoying indeed. I just installed a packet sniffer and saved *all* their SOAP responses. Not only I had the data, but I had it in a very usable, programming friendly format. Then I released the content anonymously. So yeah, don't do it.
Aillyn
A: 

By design of the tor network components it is not possible for the receiver to find out if the requester is the original source or if it's just a relayed request.

The behaviour you saw with Google was probably caused by a different security measure. Google detects if a logged-in user changes it's ip and presents a captcha just in case to prevent harmful interception and also allow the continuation of the session if an authenticated user really changed its IP (by re-logon to ISP, etc.).

Kosi2801
@Kosi2801, that's interesting, but I don't use Firefox regularly, so any cookie I had would have been weeks old. Also, what about ISPs that change people's IP addresses via DHCP? I'm not saying you're wrong, I just wondered whether they tracked Tor node IP addresses. Vidalia shows a list of all relays and their IP addresses in the UI. Perhaps Google monitors that list...
Drew Noakes
Google places cookies with an expiration date of 2 years ( http://googleblog.blogspot.com/2007/07/cookies-expiring-sooner-to-improve.html ), so a few weeks old cookie is not an issue.I do not know how many different mechanisms Google uses to identify sessions but there are plentiful of them.Just as a note, I regulary experience captchas using Google-services (one or two times a week) to continue my session and I'm not using any anonymizing technologies.These have been getting rarer though, I guess Google learns the IP-ranges I'm working from (maybe similar to Lattitude location learning).
Kosi2801
+2  A: 

You can check their ip address against a list of Tor Exit Nodes. I know for a fact this won't even slow someone down who is interested in scraping your site. Tor is too slow, most scrapers won't even consider it. There are tens of thousands of open proxy servers that can be easily scanned for or a list can be purchased. Proxy servers are nice because you can thread them or rotate if your request cap gets hit.

Google has been abused by tor users and most of the exit nodes are on Google black list and thats why you are getting a captcha.

Let me be perfectly clear: THERE IS NOTHING YOU CAN DO TO PREVENT SOMEONE FROM SCRAPING YOUR SITE.

Rook
Tor is slow in terms of latency, but you can just as easily fan the load out across concurrent requests to get the same net throughput.
Drew Noakes
@Drew Noakes I disagree proxy servers are defiantly the way to go, much faster and more control over what your ip address is. Also on a side note, ip addresses are cheap, like pennies a pop, you can just buy a massive block and then rip down some site. You need to come up with a business model that works with the internet. It boggles my mind when people try and limit access in the information age. I have a feeling your next SO question is how to implement DRM that works.
Rook
I understand your point and tend to agree. I'm not talking about trying to stop everyone, just those who aren't massively motivated or competent. Much like modern DRM deters the majority of people from ever learning how to strip it from music they buy, for example.
Drew Noakes
@Drew Noakes I think you missed my point. DRM doesn't do anything at all, just like this bogus security system. It cannot stop anything (thepiratebay.com), both the idea of trying to stop scrapping and the idea of DRM are conceived by people who do not understand.
Rook