views:

602

answers:

5

I'm building an e-commerce website with a large database of products. Of course, is nice when Goggle indexes all products of the website. But what if some competitor wants Web Scrap the website and get all images and product descriptions?

I was observing some websites with similar lists of products, and they place a CAPTCHA, so "only humans" can read the list of products. The drawback is... it is invisible for Google, Yahoo or another "Well behaved" bots.

+2  A: 

You can discover the IP addresses the Google and others are using by checking visitor IPs with whois (in the command line or on a web site). Then, once you've accumulated a stash of legit search engines, allow them into your product list without the CAPTCHA.

Nerdling
cant the screen scrapers fake their IP's extraordinarily easily?
Allen
Not if they wanted the HTTP response to get routed correctly.
Josh Einstein
+1  A: 

Since a potential screen-scaping application can spoof the user agent and HTTP referrer (for images) in the header and use a time schedule that is similar to a human browser, it is not possible to completely stop professional scrapers. But you can check for these things nevertheless and prevent casual scraping. I personally find Captchas annoying for anything other than signing up on a site.

cdonner
+1  A: 

If you're worried about competitors using your text or images, how about a watermark or customized text?

Let them take your images and you'd have your logo on their site!

Mark
A: 

One technique you could try is the "honey pot" method: it can be done either by mining log files are via some simple scripting.

The basic process is you build your own "blacklist" of scraper IPs based by looking for IP addresses which look at 2+ unrelated products in a very short period of time. Chances are these IPs belong to Machines. You can then do a reverse lookup on them to determine if they are nice (like GoogleBot or Slurp) or bad.

HipHop-opatamus
A: 

Perhaps I over-simplify, but if your concern is about server performance then providing an API would lessen the need for scrapers, and save you band/width processor time.

Other thoughts listed here:

http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/

Jason Bellows