views:

843

answers:

14

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.

Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.

Does anyone have any good tactics for preventing or even just dettering this that they could share.

Thanks

+2  A: 

I don't know why you'd deter this. The customer's offering the data.

Presumably they create value in some unique way that's not trivially reflected in the data.

Anyway.

You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.

Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.

S.Lott
+21  A: 

While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:

  • Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.

  • Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...

  • RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.

  • Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.

  • robots.txt - to deny obvious web spiders, known robot user agents.

    User-agent: *

    Disallow: /

  • Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:

    <meta name="robots" content="noindex,follow,noarchive">

There are different levels of deterrence and the first option is probably the least intrusive.

jspcal
+1 for being the only answer with some actual helpful techniques.
zombat
+23  A: 

If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.

You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.

Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.

Welbog
Yeah, this was my take on it as well. But apparently the client has had another 'expert' look at the site who has now got my client in a bit of a panic about this. It seems to me like any attempt to make it more difficult would either be trivial to overcome (eg requiring login, burying data in excessive or weird html tags) or have a severe accessibility/seo impact (eg mashing up data on the php side and 'unmashing' it again with javascript, checking if the visitor is using a 'proper' web browser etc)Thanks everyone for your help anyways.
Addsy
+4  A: 

There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.

Rule of thumb: If you want to keep something to yourself, keep it off the Internet.

Geoffrey Chetwood
A: 

My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.

People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.

So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

Chuck Vose
Why do you think screen-scraping is illegal?
DaveE
+1  A: 

There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.

I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.

Andy E
+9  A: 

Try using Flash or Silverlight for your frontend.

While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.

1kevgriff
Flash apps are even easier to 'scrape' (i.e., intercept and re-interpret data) than HTML sites for someone who knows how, if they send info to the page as AMF objects.
Alex JL
+9  A: 

There are few ways you can do it, although none are ideal.

  1. Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).

  2. Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.

Brent Baisley
+2  A: 

Using something like Adobe Flex - a Flash application front end - would fix this.

Other than that, if you want it to be easy for users to access, it's easy for users to copy.

Dean J
A: 

What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.

Turn this off for google IPs!

Shawn Leslie
+1  A: 

Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?

He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.

It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.

Even Mien
A: 

Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:

How do you stop scripters from slamming your website hundreds of times a second?

Alix Axel
A: 

force a reCAPTCHA every 10 page loads for each unique IP

A: 

nice article about it:

http://perishablepress.com/press/2010/09/24/content-scrapers-suck-ass/

Marko