views:

3850

answers:

23

Following on from my question on the Legalities of screen scraping, even if it's illegal people will still try, so:

What technical mechanisms can be employed to prevent or at least disincentivise screen scraping?

Oh and just for grins and to make life difficult, it may well be nice to retain access for search engines. I may well be playing devil's advocate here but there is a serious underlying point.

+39  A: 

You can’t prevent it.

Bombe
You _can_ make it difficult
too much php
yes, yes you can - just not very well
annakata
I love those answers... "You can't",.. Everything can be done. In one way or another.
Stefan
Ok, you can do it. Just don't output anything. Show your user a blank page. Missing accomplished: screen scraping prevented!
Bob Somers
"You can't"... "Everything can be done". Two absolutes that are never true.
Ed Swangren
"never true", another absolute... :)
Bill the Lizard
13 votes for something demonstrably incorrect though...
annakata
The Head of Computing at my college often said, "If you think a problem can't be solved, then maybe you just don't know enough about the problem."
Lee Kowalkowski
You have to give the user the data (so they can use your page). You have to not give the user the data (or they can scrape it). If you have further problems, consult a Zen master, 'cause this software guy is out of ideas.
David Thornley
+5  A: 

It would be very difficult to prevent. The problem is that Web pages are meant to be parsed by a program (your browser), so they are exceptionally easy to scrape. The best you can do is be vigilant, and if you find that your site is being scraped, block the IP of the offending program.

Bill the Lizard
+1  A: 

Very few I think given the intention of any site is to publish (i.e. to make public) information.

  • You can hide your data behind logins of course, but that's a very situational solution.

  • I've seen apps which would only serve up content where the request headers indicated a web browser (rather than say anonymous or "jakarta") but that's easy to spoof and you'll lose some genuine humans.

  • Then there's the possibility that you accept some scrapage but make life insurmountably hard for them by not serving content if requests are coming from the same IP at too high a rate. This suffers from not being full coverage but more importantly there is the "AOL problem" that an IP can cover many many unique human users.

Both of the last two techniques also depend heavily on having traffic intercepting technology which is an inevitable performance and/or financial outlay.

annakata
A: 

You could check the user agent of clients coming to your site. Some third party screen scraping programs have their own user agent so you could block that. Good screen scrapers however spoof their user agent so you won't be able to detect it. Be careful if you do try to block anyone because you don't want to block a legitimate user :)

The best you can hope for is to block people using screen scrapers that aren't smart enough to change their user agent.

Alex
A: 

I tried to "screen scrape" some PDF files once, only to find that they'd actually put the characters in the PDF in semi-random order. I guess the PDF format allows you to specify a location for each block of text, and they'd used very small blocks (smaller than a word). I suspect that the PDFs in question weren't trying to prevent screen scraping so much as they were doing something weird with their render engine.

I wonder if you could do something like that.

Paul Tomblin
A: 

You could put everything in flash, but in most cases that would annoy many legitimate users, myself included. It can work for some information such as stock prices or graphs.

too much php
Google already indexes Flash files, so this would theoretically not block indexing or scraping of content: http://googlewebmastercentral.blogspot.com/2008/06/improved-flash-indexing.html
Dave R.
+1  A: 

Given that most sites want a good search engine ranking, and search engines are scraper bots, there's not much you can do that won't harm your SEO.

You could make an entirely ajax loaded site or flash based site, which would make it harder for bots, or hide everything behind a login, which would make it harder still, but either of these approaches is going to hurt your search rankings and possibly annoy your users, and if someone really wants it, they'll find a way.

The only guaranteed way of having content that can't be scraped is to not publish it on the web. The nature of the web is such that when you put it out there, it's out there.

seanb
+1  A: 

Prevent? -- impossible, but you can make it harder.

Disincentivise? -- possible, but you won't like the answer: provide bulk data exports for interested parties.

On the long run, all your competitors will have the same data if you publish it, so you need other means of diversifying your website (e.g. update it more frequently, make it faster or easier to use). Nowdays even Google is using scraped information like user reviews, what do you think you can do about it? Sue them and get booted from their index?

mjy
-1 for using 'Disincentivise' - eschew obfuscation!
egrunin
it's in the question, not my fault ...
mjy
+4  A: 

Search engines ARE screen scrapers by definition. So most things you do to make it harder to screen scrape will also make it harder to index your content.

Well behaved robots will honour your robots.txt file. You could also block the IP of known offenders or add obfuscating HTML tags into your content when it's not sent to a known good robot. It's a loosing battle though. I recommend the litigation route for known offenders.

You could also hide identifying data in the content to make it easier to track down offenders. Encyclopaedias have been known to to add Fictitious entries to help detect and prosecute copyright infringers.

Chris Nava
+8  A: 

It's pretty hard to prevent screen scraping but if you really, really wanted to you could change your HTML frequently or change the HTML tag names frequently. Most screen scrapers work by using string comparisons with tag names, or regular expressions searching for particular strings etc. If you are changing the underlying HTML it will make them need to change their software.

KiwiBastard
how do propose "changing" standard HTML tags and having browsers display the HTML? This doesn't make any sense
fuzzy lollipop
No not what I meant - by changing the HTML, I meant changing the HTML code or structure, as the scraping apps usually expect the HTML code to be in a particular form with particular names. Changing them regularly would mean the scraping app would also need to be recoded.Like I said, not going to prevent it, but could annoy the scraper sufficiently enough that they stop
KiwiBastard
+1  A: 

One way is to create an function that takes text and position and then Serverside generate x, y pos for every character in the text, generate divs in random order containing the characters. Generate a javascript that then posision every div on right place on screen. Looks good on screen but in code behind there is no real order to fetch the text if you dont go throuh the trouble to scrape via your javascript (that can be changed dynamically every request)

Too much work and have possibly many quirks, it depends on how much text and how complicate UI you have on the site and other things.

Stefan
+1  A: 

If its not much information you want to protect you can convert it to a picture on the fly. Then they must use OCR wich makes it easier to scrape another site instead of yours..

Stefan
+9  A: 

So, one approach would be to obfuscate the code (rot13, or something), and then have some javascript in the page that do something like document.write(unobfuscate(obfuscated_page)). But this totally blows away search engines (probably!).

Of course this doesn’t actually stop someone who wants to steal your data either, but it does make it harder.

Once the client has the data it is pretty much game over, so you need to look at something on the server side.

Given that search engines are basically screen scrapers things are difficult. You need to look at what the difference between the good screen scrapers and the bad screen scrapers are. And of course, you have just the normal human users as well. So this comes down to a problem of how can you on the server effectively classify as request as coming from a human, a good screen scraper, or a bad screen scraper.

So, the place to start would be looking at your log-files and seeing if there is some pattern that allows you to effectively classify requests, and then on determining the pattern see if there is some way that a bad screen scraper, upon knowing this classification, could cloak itself to appear like a human or good screen scraper.

Some ideas:

  • You may be able to determine the good screen scrapers by IP address(es)..
  • You could potentially determine scraper vs. human by number of concurrent connections, total number of connections per time-period, access pattern, etc.

Obviously these aren’t ideal or fool-proof. Another tactic is to determine what measures can you take that are unobtrusive to humans, but (may be) annoying for scrapers. An example might be slowing down the number of requests. (Depends on the time criticality of the request. If they are scraping in real-time, this would effect their end users).

The other aspect is to look at serving these users better. Clearly they are scraping because they want the data. If you provide them an easy way in which to directly obtain the data in a useful format then that will be easier for them to do instead of screen scraping. If there is an easy way then access to the data can be regulated. E.g: give requesters a unique key, and then limit the number of requests per key to avoid overload on the server, or charge per 1000 requests, etc.

Of course there are still people who will want to rip you off, and then there are probably other ways to disincentivise, bu they probably start being non-technical, and require legal avenues to be persued.

benno
this "solution" doesn't prevent "screen scraping" in any way, I can just save the rendered HTML to a disk and parse it there as much as I like.
fuzzy lollipop
I think I addressed very clearly in the answer: "Of course this doesn’t actually stop someone who wants to steal your data either".
benno
A: 

Whatever you may do, it will always be

"Difficult, yet possible and Possible, though difficult!"

Mohit Nanda
-1, this is just a comment that doesn't answer the question at all
Lord Torgamus
A: 

I suspect there is no good way to do this.

I suppose you could run all your content through a mechanism to convert text to images rendered using a CAPTCHA-style font and layout, but that would break SEO and annoy your users.

Adam Jaskiewicz
+2  A: 

The best return on investment is probably to add random newlines and multiple spaces, since most screen scrapers work from the HTML as text rather than as a XML (since most pages don't parse as valid XML).

The browser ignores whitespace, so your user's don't notice that

  Price : 1
  Price :    2
  Price\n:\n3

are different. (this comes from my experience scraping government sites with AWK).

Next step is adding tags around random elements to mess up the DOM.

Dave
Those are easy to get around. This is a weak solution. Changing the HTML would be better like from `<a href="link">` to `< a href = 'link'>` for example. But it is still fairly simple to accomodate for in the code. Yeah it may delay someone for a few minutes, but it can be overcome quickly.
cdburgess
A: 

Well, before you push the content from the server to the client, remove all the \r\n, \n, \t and replace everything with nothing but a single space. Now you have 1 long line in your html page. Google does this. This will make it hard for others to read your html or JavaScript.
Then you can create empty tags and randomly insert them here and there. The will have no effect.
Then you can log all the IPs and how often they hit your site. If you see one that comes in on time everytime, you mark it as robot and block it.
Make sure you leave the search engines alone if you want them to come in.
Hope this helps

VN44CA
+3  A: 

Don't prevent it, detect it and retaliate those who try.

For example, leave your site open to download but disseminate some links that no sane user would follow. If someone follows that link, is clicking too fast for a human or other suspicious behaviour, react promptly to stop the user from trying. If there is a login system, block the user and contact him regarding unacceptable behaviour. That should make sure they don't try again. If there is no login system, instead of actual pages, return a big warning with fake links to the same warning.

This really applies for things like Safari Bookshelf where a user copy-pasting a piece of code or a chapter to mail a colleague is fine while a full download of book is not acceptable. I'm quite sure that they detect when some tries to download their books, block the account and show the culprit that he might get in REAL trouble should he try that again.

To make a non-IT analogy, if airport security only made it hard to bring weapons on board of planes, terrorists would try many ways to sneak one past security. But the fact that just trying will get you in deep trouble make it so that nobody is going to try and find the ways to sneak one. The risk of getting caught and punished is too high. Just do the same. If possible.

Eric Darchis
+4  A: 

I've written a blog post about this here: http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/

To paraphrase:

If you post information on the internet someone can get it, it's just a matter of how many resources they want to invest. Some means to make the required resources higher are:

Turing tests

The most common implementation of the Turning Test is the old CAPTCHA that tries to ensure a human reads the text in an image, and feeds it into a form.

We have found a large number of sites that implement a very weak CAPTCHA that takes only a few minutes to get around. On the other hand, there are some very good implementations of Turing Tests that we would opt not to deal with given the choice, but a sophisticated OCR can sometimes overcome those, or many bulletin board spammers have some clever tricks to get past these.

Data as images

Sometimes you know which parts of your data are valuable. In that case it becomes reasonable to replace such text with an image. As with the Turing Test, there is ORC software that can read it, and there’s no reason we can’t save the image and have someone read it later.

Often times, however, listing data as an image without a text alternate is in violation of the Americans with Disabilities Act (ADA), and can be overcome with a couple of phone calls to a company’s legal department.

Code obfuscation

Using something like a JavaScript function to show data on the page though it’s not anywhere in the HTML source is a good trick. Other examples include putting prolific, extraneous comments through the page or having an interactive page that orders things in an unpredictable way (and the example I think of used CSS to make the display the same no matter the arrangement of the code.)

CSS Sprites

Recently we’ve encountered some instances where a page has one images containing numbers and letters, and used CSS to display only the characters they desired. This is in effect a combination of the previous 2 methods. First we have to get that master-image and read what characters are there, then we’d need to read the CSS in the site and determine to what character each tag was pointing.

While this is very clever, I suspect this too would run afoul the ADA, though I’ve not tested that yet.

Limit search results

Most of the data we want to get at is behind some sort of form. Some are easy, and submitting a blank form will yield all of the results. Some need an asterisk or percent put in the form. The hardest ones are those that will give you only so many results per query. Sometimes we just make a loop that will submit the letters of the alphabet to the form, but if that’s too general, we must make a loop to submit all combination of 2 or 3 letters–that’s 17,576 page requests.

IP Filtering

On occasion, a diligent webmaster will notice a large number of page requests coming from a particular IP address, and block requests from that domain. There are a number of methods to pass requests through alternate domains, however, so this method isn’t generally very effective.

Site Tinkering

Scraping always keys off of certain things in the HTML. Some sites have the resources to constantly tweak their HTML so that any scrapes are constantly out of date. Therefore it becomes cost ineffective to continually update the scrape for the constantly changing conditions.

Jason Bellows
A: 

What about using the iText library to create PDFs out of your database information? As with Flash, it won't make scraping impossible, but might make it a little more difficult.

Nels

Nels Beckman
no it will actually make it EASIER to parse, PDF files are really well supported for searching and indexing their text contents so all I need to do is download the .pdf files and process them off line at my leisure.
fuzzy lollipop
A: 

Old question, but- adding interactivity makes screen scraping much more difficult. If the data isn't in the original response- say, you made an AJAX request to populate a div after page load- most scrapers won't see it.

For example- I use the mechanize library to do my scraping. Mechanize doesn't execute Javascript- it isn't a modern browser- it just parses HTML, let's me follow links and extract text, etc. Whenever I run into a page that makes heavy use of Javascript, I choke- without a fully scripted browser (that supports the full gamut of Javascript) I'm stuck.

This is the same issue that makes automated testing of highly interactive web applications so difficult.

Matt Luongo
A: 

As a professional developer Techniques above to try to protect websites like using javascript,ajax, flash, captcha changing html code often, converting to pdfs can all be over come with good knowledge and sophisticated formulas and regex. Even if ips are tracked and recorded, proxy servers can be deployed ie you can easily deploy a different proxy for each individual page..... I guess this is the holy grail one to put some effort into and some thought, I am currently working on a solution to this issue if you would like to chat further let me now......

Brett W
A: 

I never thought that preventing print screen would be possible... well what do you know, checkout the new tech - sivizion.com. With their video buffer technology there is no way to do a print screen, cool, really cool, though hard to use ... I think they license the tech also, check it out. (If I am wrong please post here how it can be hacked.) Found it here: http://stackoverflow.com/questions/448106/how-do-i-prevent-print-screen

Tom