views:

1422

answers:

16

I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches for them).

How can I prevent screen scraping? Is it even possible?

+10  A: 

Sorry It's really quite hard to do this...

I would sugget that you politely ask them to not use your content (if your content is copywrited)

If it is and they don't take it down, then you can take furthur action and send them a cease and desist letter

Generally what ever you do to prevent scaping will probably end up with a more negative effect. e.g. accesbility, bots/spiders etc.

Lizard
+5  A: 

Your best option is unfortunately fairly manual: look for traffic patterns that you believe are indicative of scraping and ban their IPs.

Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly, if a search-engine can crawl and scrape your site then an malicious scraper can as well. It's a fine-line to walk.

STW
+22  A: 

There is really nothing you can do to completely prevent this. Scrapers can fake their user agent, use multiple IPs, etc and appear as a normal user. The only thing you can do is make the text not available at the time the page is loaded - make it with image, flash, or load it with javascript. However, the first 2 are bad ideas, and the last one would be an accessibility issue if js is not enabled for some of your regular users.

If they are absolutely slamming your site and rifling through all of your pages, you could do some kind of rate limiting.

There is some hope though. Scrapers rely on your site's data being in a consistent format. If you could randomize it somehow it could break their scraper. Things like changing the ID or class names of page elements on each load, etc. But that is a lot of work to do and I'm not sure if it's worth it. And even then, they could probably get around it with enough dedication.

ryeguy
Creating a system that limits how many pages an IP can view per minute is a good hack, as screen scrapers will rip through the site much faster than any normal person.
TravisO
Agreed. IPs cost money and are limited by most hosting companies.
Tal Weiss
re:"Things like changing the ID or class names of page elements on each load, etc." That's not so bad if you create a class name via PHP and then just use <div class="<?php echo $myPHPGeneratedClassName; ?>"> you could even use random strings in it to make it completely unique. Anything that stops them finding patterns makes it a lot harder to parse out of the DOM.
niggles
It's not hard to find an IP. There's plenty of proxies, friends computers, work computers, school computers, library computers...
@user257493: True, but we're talking about someone who's scraping data here. Chances are they aren't going to go to *that* much effort just to harvest data. And if they do, you'll eventually deplete their supply of IPs.
ryeguy
A: 

Sure it's possible. For 100% success, take your site offline.

In reality you can do some things that make scraping a little more difficult. Google does browser checks to make sure you're not a robot scraping search results (although this, like most everything else, can be spoofed).

You can do things like require several seconds between the first connection to your site, and subsequent clicks. I'm not sure what the ideal time would be or exactly how to do it, but that's another idea.

I'm sure there are several other people who have a lot more experience, but I hope those ideas are at least somewhat helpful.

Wayne Werner
A: 

Putting your content behind a captcha would mean that robots would find it difficult to access your content. However, humans would be inconvenienced so that may be undesirable.

SorcyCat
+1  A: 

I agree with most of the posts above, and I'd like to add that the more search engine friendly your site is, the more scrape-able it would be. You could try do a couple of things that are very out there that make it harder for scrapers, but it might also affect your search-ability.. depends on how well you want your site to rank on search engines ofcourse.

sjobe
+3  A: 

Provide an XML API to access your data; in a manner that is simple to use. If people want your data, they'll get it, you might as well go all out.

This way you can provide a subset of functionality in an effective manner, ensuring that, at the very least, the scrapers won't guzzle up HTTP requests and massive amounts of bandwidth.

Then all you have to do is convince the people who want your data to use the API. ;)

Williham Totland
This seems very reasonable. Screen scraping is damn hard to prevent, and if you provide an API, you can put some restrictions on it, add notices ("Content from ----.com"), and basically control what data is given.
alecwh
+57  A: 

I will presume that you have set up robots.txt.

As others have mentioned, scrapers can fake nearly every aspect of their activities, and it is probably very difficult to identify the requests that are coming from the bad guys.

What I would consider doing is:

  1. Set up a page /jail.html
  2. Disallow access to the page in robots.txt (so the respectful spiders will never visit)
  3. Place a link on one of your pages, hiding it with CSS (display: none).
  4. Record IPs of visitors to /jail.html

This might help you to quickly identify requests from scrapers that are flagrantly disregarding your robots.txt.

You might also want to make your /jail.html a whole entire website that has the same, exact markup as normal pages, but with fake data (/jail/album/63ajdka, /jail/track/3aads8, etc.). This way, the bad scrapers won't be alerted to "unusual input" until you have the chance to block them entirely.

Daniel Trebbien
I've seen this technique referred to as a "honeypot" before. It's a technique also used in spam filtering, where you put an email address on a page but hide it or make it clear it isn't for people to send legitimate mail to. Then collect the IP address of any mail server that delivers mail to that address.
thomasrutter
The robots.txt -> fake data is actually quite brilliant assuming it works: it goes with the nature of the attack and overcomplies. You could do all kinds of things that are related. For instance, exclude something in robots.txt and link to it with an invisibly-colored link that humans will not see. Now the only agents that will get there are your scrapers.
Yar
This assumes they are crawling links. Most scrapers will try to submit to a form of some kind and scrape the data returned.
Byron Whitlock
I've seen Perl based honeypots for email that have links to other "pages" that are generated by the Perl script. Legitimate bots that read robots.txt don't look at it, and it's hidden from users via CSS, but scrapers (or email harvesters) quickly get caught in an infinite-depth tree of pages, all with bad data on them. Put a link to the script right at the start of each of your pages.
Stephen P
So what about a legitimate user has CSS disabled and clicks on your supposedly invisible honeypot link?
Lotus Notes
Another awesome thing to toss in for honeypots is teergrubing (or tarpitting). This is an old technique that I love - when you identify a bad guy, you bring his spamming/scraping process to a crawl by purposefully keeping his connections open for as long as physically possible without timing them out. Of course, this may alert them that you're on to them as well, but gosh darn it's fun. http://en.wikipedia.org/wiki/Teergrubing
womp
The only problem with this approach is if I place [img] http://yoursite/jail.html [/img] on a popular forum. You will receive tons IP logged into your system and it will be hard to filter which one are the bad one. If you want to prevent this kind of thing, you need to add a token associated with IP in the URL. Something like jail.php?t=hoeyvm and in database you have an association of hoeyvm and the IP who requested the page.
HoLyVieR
@HoLyVieR: That's a very good idea
Daniel Trebbien
I've never come accross a database where I have been unable to scrape it's output pages with a simple script... The last one I did the guy tried to obstructe the markup, the method for "next page" was awkard, it stored session variables and cookies, checked for time patterns... I did the entire site in maybe... 10 lines of perl or less directly into a CSV. I could just as easily gone back in a day from another IP if he blocked me. But lets be real here, if you put information on-line, it's out there, your only real actions are legal ones.
+6  A: 

Okay as all posts says if you want to make it search engine friendly then bots can scrap for sure .

But few things you can still do and it may be affective for 60-70 % scrapping bots.

Make a checker script like below.

if an particular ip visiting very fast then after few visits (5-10) put it ip+browser info in a file or DB.

Next Step. (This would be a background process and running all time or scheduled after few minutes) Make one another script that will keep on checking those suspicious ips.

Case 1. If the user Agent is of known search engine like google, bing,yahoo (you can find more info on user agents by googling it). then you must see http://www.iplists.com/ this list and try to match patterns .And if it seems a faked user-agent then ask to fill captcha on next visit. (You need to research a bit more on bots ips . I know this is achievable and also try whois of ip ,can be helpful)

Case 2. No user agent of a search bot simply ask to fil capthca on next visit.

Hope above will help

Sorry for bad englis :)

You can contact me if needs further help on making this stuff.(kinda interesting stuff)

Arsheep
+1 Using iplists.com is an excellent idea
Daniel Trebbien
A: 

Screen scrapers work by processing html. And if they are determined to get your data there is not much you can do technically because the human eyeball processes anything. Legally its already been pointed out you may have some recourse though and that would be my recommendation.

However, you can hide the critical part of your data by using non-html based presentation logic

  • Generate a flash file for each artist/album etc
  • Generate an image for each artist content. Maybe just an image for the artist name etc would be enough. Do this by rendering the text onto a jpg/png on the server and linking to that image.

Bere in mind that this would probably affect your search rankings.

James Westgate
+14  A: 

Sue `em.

Seriously: If you have some money, talk to a good, nice, young lawyer who knows their way around the Internets. You could really be able to do something here. Depending on where the sites are based, you could have a lawyer write up a cease & desist or its equivalent in your country. You may be able to at least scare the bastards.

Document the insertion of your dummy values. Insert dummy values that clearly (but obscurely) point to you. I think this is common practice with phone book companies, and here in Germany, I think there have been several instances when copycats got busted through fake entries they copied 1:1.

It would be a shame if this would drive you into messing up your HTML code, dragging down SEO, validity and other things (even though a templating system that uses a slightly different HTML structure on each request for identical pages might already help a lot against scrapers that always rely on HTML structures and class/ID names to get the content out.)

Cases like this are what copyright laws are good for. Ripping off other people's honest work to make money with is something that you should be able to fight against.

Unicron
+1. Sometimes the best solution isn't more code.
Roger Pate
+1  A: 
  1. No it's not possible to stop (in any way)
  2. Embrace it, why not publish as RDFa and become super search engine friendly and encourage the re-use of data, people will thank you and provide credit where due (see musicbrainz as an example)

not the answer you probably want, but why hide what you're trying to make public?

nathan
A: 

There are a few things you can do to try and prevent screen scraping. Some are not very effective, while others (a CAPTCHA) are, but hinder usability. You have to keep in mind too that it may hinder legitimate site scrapers, such as search engine indexes.

However, I assume that if you don't want it scraped that means you don't want search engines to index it either.

Here are some things you can try:

  • Show the text in an image. This is quite reliable, and is less of a pain on the user than a CAPTCHA, but means they won't be able to cut and paste and it won't scale prettily or be accessible.
  • Use a CAPTCHA and require it to be completed before returning the page. This is a reliable method, but also the biggest pain to impose on a user.
  • Require the user to sign up for an account before viewing the pages, and confirm their email address. This will be pretty effective, but not totally - a screen-scraper might set up an account and might cleverly program their script to log in for them.
  • If the client's user-agent string is empty, block access. A site-scraping script will often be lazily programmed and won't set a user-agent string, whereas all web browsers will.
  • You can set up a black list of known screen scraper user-agent strings as you discover them. Again, this will only help the lazily-coded ones; a programmer who knows what he's doing can set a user-agent string to impersonate a web browser.
  • Change the URL path often. When you change it, make sure the old one keeps working, but only for as long as one user is likely to have their browser open. Make it hard to predict what the new URL path will be. This will make it difficult for scripts to grab it if their URL is hard-coded. It'd be best to do this with some kind of script.

If I had to do this, I'd probably use a combination of the last three, because they minimise the inconvenience to legitimate users. However, you'd have to accept that you won't be able to block everyone this way and once someone figures out how to get around it, they'll be able to scrape it forever. You could then just try to block their IP addresses as you discover them I guess.

thomasrutter
A: 

You can't stop normal screen scraping. For better or worse, it's the nature of the web.

You can make it so no one can access certain things (including music files) unless they're logged in as a registered user. It's not too difficult to do in Apache. I assume it wouldn't be too difficult to do in IIS as well.

Dinah
A: 

Rather than blacklisting bots, maybe you should whitelist them. If you don't want to kill your search results for the top few engines, you can whitelist their user-agent strings, which are generally well-publicized. The less ethical bots tend to forge user-agent strings of popular web browsers. The top few search engines should be driving upwards of 95% of your traffic.

Identifying the bots themselves should be fairly straightforward, using the techniques other posters have suggested.

Chris
A: 

Generate the html, css and javascript. It is easier to write generators than parsers, so you could generate each served page differently. You can no longer use a cache or static content then.

Stephan Eggermont