views:

303

answers:

7

How can I prevent my asp.net 3.5 website from being screen scraped by my competitor? Ideally, I want to ensure that no webbots or screenscrapers can extract data from my website.

Is there a way to detect that there is a webbot or screen scraper running ?

+1  A: 

I don't think it is possible without authenticating users to your site.

Raj Kaimal
@Raj, so authentication will prevent that (of course the competition can register and run a s/s )
user279521
Authentication Will not even hinder it, if they want to scrape they will script out that process easily.
James Campbell
Wasn't sure if you had a list of authorized users that could access your app. Obviously this is not the case here.
Raj Kaimal
+4  A: 

Unplug the network cable to the server.

paraphrase: if public can see it, it can be scraped.

update: upon second look it appears that I am not answering the question. Sorry. Vecdid has offered a good answer.

But any half decent coded could defeat the measures listed. In that context, my answer could be considered valid.

Sky Sanders
+1 best answer yet, might not be what the op wanted to hear, but it's the only solution.
mxmissile
+1  A: 

Ultimately you can't stop this.

You can make it harder for people to do, by setting up the robots.txt file etc. But you've got to get information onto legitimate users screens so it has to be served somehow, and if it is then your competitors can get to it.

If you force users to log in you can stop the robots all the time, but there's nothing to stop a competitor registering for your site anyway. This may also drive potential customers away if they can't access some information for "free".

ChrisF
@ChrisF, is there a way to detect that there is a webbot or screen scraper running ?
user279521
@user - check out the other answers from people with more experience in this area than me
ChrisF
A: 

I don't think that's possible. But whatever you'll come up with, it'll be as bad for search engine optimization as it will be for the competition. Is that really desirable?

JulianR
+5  A: 

It is possible to try to detect screen scrapers:

Use cookies and timing, this will make it harder for those out of the box screen scrapers. Also check for javascript support, most scrapers do not have it. Check Meta browser data to verify it is really a web browser.

You can also check for requests in a minute, a user driving a browser can only make a small number of requests per minute, so logic on the server that detects too many requests per minute could presume that screen scraping is taking place and prevent access from the offending IP address for some period of time. If this starts to affect crawlers, log the users ip that is blocked, and start allowing their IPs as needed.

You can use http://www.copyscape.com/ to proect your content also, this will at least tell you who is reusing your data.

See this question also:

http://stackoverflow.com/questions/396817/protection-from-screen-scraping

Also take a look at

http://blockscraping.com/

Nice doc about screen scraping:

http://www.realtor.org/wps/wcm/connect/5f81390048be35a9b1bbff0c8bc1f2ed/scraping_sum_jun_04.pdf?MOD=AJPERES&CACHEID=5f81390048be35a9b1bbff0c8bc1f2ed

How to prevent screen scraping:

http://mvark.blogspot.com/2007/02/how-to-prevent-screen-scraping.html

James Campbell
+1 good answer. but... I have beaten most of those guards, thus my answer. ;-)
Sky Sanders
His question is, is it possible to detect. It is, and it is easy to make it a pain to write a program to scrape the site, it is not 100% but it will make it harder. If a user can bring it up in the browser, it can be scripted, unless you use captcha to access the info you don't want scraped.
James Campbell
Yes, you are right. I am guilty of answering a different question.
Sky Sanders
A: 

If your competitor is in same country as you, have an acceptable use policy and terms of service clearly posted on your site. Mention the fact that you do not allow any sort of robots/screen scrapping etc. If that continues, get an attorney to send them a friendly cease and desist letter.

jini
A: 

How about serve up every bit of text as an image? Once that is done, either your competitors will be forced to invest OCR technologies, or you will find that you have no users - so the question will be moot.

Peter M