views:

318

answers:

3

I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site.

Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the scraping crawler.

  • If, as a defender, I detect an unknown crawler that hits the site at regular intervals, the attacker will randomize the intervals.
  • If, as a defender, I detect the same agent/IP, the attacker will randomize the agent.

And this is where I get lost - if an attacker randomizes the intervals and the agent, how would I not discriminate against proxies and machines hitting the site from the same network?

I am thinking of checking the suspect agent with javascript and cookie support. If the bogey can't do either consistently, then it's a bad guy.

What else can I do? Are there any algorithms, or even systems designed for quick on-the-fly analysis of historical data?

+2  A: 

Don't try and recognize by IP and timing or intervals--use the data you send to the crawler to trace them.

Create a whitelist of known good crawlers--you'll serve them your content normally. For the rest, serve pages with an extra bit of unique content that only you will know how to look for. Use that signature to later identify who has been copying your content and block them.

sj2009
+6  A: 

My solution would be to make a trap. Put some pages on your site where access are banned by robots.txt. Make a link on you page, but hide it with CSS, then ip ban anybody who goes to that page.

This will force the offender to obey robots.txt, which means that you can put important information or services permanently away from him, which will make his carbon-copy clone useless.

tomjen
Whats to stop the attacker from changing his crawler to ignore links that are hidden? Either by automated scanning and comparison of your .css to the link or link area classes or just a daily peek at your source and quick change to his ignore link list in his crawler?What if he makes an algorithm to verify the quality of a links content as to avoid trap links?
ian
isn't hiding links by CSS considered cheating by google bot (no matter for what purpose, they will not care)?
Marek
A nice trick is to *only* mention the honeypot as disallowed in robots.txt (and don't link to it anywhere) - some evil robots read the file and then crawl the disallowed links in hopes of finding some juicy data, and BLAM! Banned!
Piskvor
+2  A: 

And how do you keep someone from hiring a person in a country with low wages to use a browser to access your site and record all of the information? Set up a robots.txt file, invest in a security infrastructure to prevent DoS attacks, obfuscate your code (if accessible, like javascript), patent your inventions, and copyright your site. Let the legal people worry about someone ripping you off.

tvanfosson
For the purposes of this "exercise" we can assume that a) there is too much data to copy manually b) the data changes frequently c) the attacker is a no-good punk who will never spend money to have someone do this.
Andrei Taranchenko
Track the punk down with GeoIP and have your Uncle Sal "make him an offer he can't refuse." :-)
tvanfosson