I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site.
Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the scraping crawler.
- If, as a defender, I detect an unknown crawler that hits the site at regular intervals, the attacker will randomize the intervals.
- If, as a defender, I detect the same agent/IP, the attacker will randomize the agent.
And this is where I get lost - if an attacker randomizes the intervals and the agent, how would I not discriminate against proxies and machines hitting the site from the same network?
I am thinking of checking the suspect agent with javascript and cookie support. If the bogey can't do either consistently, then it's a bad guy.
What else can I do? Are there any algorithms, or even systems designed for quick on-the-fly analysis of historical data?