views:

467

answers:

10

Pretty sure this question counts as blasphemy to most web 2.0 proponents, but I do think there are times when you could possibly not want pieces of your site being easily ripped off into someone else's arbitrary web aggregator. At least enough so they'd need to be arsed to do it by hand if they really wanted it.

My idea was to make a script that positioned text nodes by absolute coordinates in the order they'd appear normally within their respective paragraphs, but then stored those text nodes in a random, jumbled up order in the DOM. Of course, getting a system like that to work properly (proper text wrap, alignment, styling, etc.) seems almost akin to writing my own document renderer from scratch.

I was also thinking of combining that with a CAPTCHA-like thing to muss up the text in subtle ways so as to hinder screen scrapers that could simply look at snapshots and discern letters or whatnot. But that's probably overthinking it.

Hmm. Has anyone yet devised any good methods for doing something like this?

A: 

Just load all your HTML via AJAX calls and the HTML will not "appear" to be in the DOM to most screen scrapers.

mmattax
Not really a solution IMO. Someone trying to scrape the content from your site doesn't care where the data comes from, AJAX will simply make it easier to get the data because now it will be well-formed / structured. Don't forget, URL the browser can call, the scraper can as well .. this applied to RESTful services and other similar AJAX APIs.
raiglstorfer
+3  A: 

Your ideas would probably break any screen-readers as well, so you should check accessibility requirements/legislation before messing up ordering.

Douglas Leeder
Well I'm referring only to specific portions of a web page. I may wish to keep the rest untouched so that wouldn't be a problem. Maybe also apply tags to the affected sections so search engines will be able to figure out the gist of what it is...
Daddy Warbox
+6  A: 

Consider that everything that the scraper can't read, search engines can't read either. With that been said, you could inject content into your document via Javascript after the page has loaded.

Eran Galperin
As someone that has scraped sites before, I can tell you that scraping a Javascript file is often easier than scraping an HTML page. I would not suggest this as a method for hiding data.
raiglstorfer
I was of course referring to fetching the data by Ajax. How do you scrape text that does not exist on the document?
Eran Galperin
Fetching data by AJAX is even easier to scrape ... a simple proxy server will tell you what services are being called via AJAX to render the page ... at which point you can just skip scraping the page and consume the original data feed.
raiglstorfer
A "simple" proxy server is not really a part of your basic scraping tools. Regardless, you can use session tokens to make sure the original page is accessed first, and also limit requests based on those tokens.In the end, anything available to a user, can be read by an automatic scraper if its configured properly. The point is just to make this as hard as possible.
Eran Galperin
+3  A: 

Please don't use absolute positioning to reassemble a scrambled page. This won't work for mobile devices, screen readers for the visually impaired, and search engines.

Please don't add captcha. It will just drive people away before they ever see your site.

Any solution you come up with will be anti-web. The Internet is about sharing, and you have to take the bad with the good.

If you must do something, you might want to just use Flash. I haven't seen link farmers grabbing Flash content, yet. But for all the reasons stated in the first paragraph, Flash is anti-web.

Michael L Perry
I warned you it might be blasphemy. ;)Flash isn't so much as immune to scraping anymore, from what I've heard.
Daddy Warbox
I agree, the only way to totally block scraping is to require captchas ... and that will kill the usability of your site. Perhaps if used sparingly or randomly within a session?
raiglstorfer
A: 

Render all your text in SVG using something like ImageMagick

Mike Deck
SVG is only native to firefox at the moment, isn't it? That's not a bad idea otherwise, though...
Daddy Warbox
doesn't need to be SVG either. If you're happy to abandon all attempt at being friendly to screen readers, search engines, disabled users etc, then why not just create a great big .gif for each page?
Colin Pickard
+3  A: 

I've seen a TV guide decrypt using javascript on the client side. It wouldn't stop a determined scraper but would stop most casual scripting.

All the textual TV entries are similar ps10825('4VUknMERbnt0OAP3klgpmjs....abd26') where ps10825 is simply a function that calls their decrypt function with a key of ps10825. Obviously the key is generate each time.

In this case i think it's quite adequate to stop 99% of people using Greasemonkey or even wget scripts to download their TV guide without seeing all of their adverts.

Mark Nold
That's a good idea.
Daddy Warbox
How valuable is the information you're trying to hide? If you don't change your algorithm frequently, it's trivial to replicate its functionality. If you do, then I can always hook my script into a local JS execution engine (eg. spidermonkey). One way or another - it's quite simple to break this protection in max 1 full day.
viraptor
+4  A: 
//<trying_to_be type="funny">

I strongly suggest to use the Marquee Tag:

<marquee behavior="alternate">This text will bounce from left to right</marquee>

If they see it on your page, they will surely not aggregate your site.

//</trying_to_be>
Andre Bossard
I'd love to upvote you for hilarity, but I'll just leave you with a "well played" comment instead. :)
kooshmoose
A: 

Alexa.com does some wacky stuff to prevent scraping. Go here and look at the traffic rank number http://www.alexa.com/data/details/traffic_details/teenormous.com

+1  A: 

To understand this it is best to attempt to scrape a few sites. I have scraped some pretty challenging sites like banking sites. I've seen many attempts at making scraping difficult (e.g. encryption, cookies, etc). At the end of the day the best defense is unpredictable markup. Scrapers rely most heavily on being able to fing "patterns" in the markup. The moment the pattern changes, the scraping logic fails. Scrapers are notoriously brittle and often break down easily.

My suggestion, randomly inject non-visible markup into your code. In particular around content that is likely to be interesting. Do anything you can think of to make your markup look different to a scraper each time it is invoked.

raiglstorfer
A: 

Few of these techniques will stop the determined. Alexa-style garbage-HTML/CSS-masking is easy to get around (just parse the CSS); AJAX/Javascript-DOM-insertion is easy to get around as well, although form authenticity tokens make this harder.

I've found providing an official API to be the best deterrent :)

Barring that, rendering text into an image is a good way to stop the casual scraper (but also still doable)

YouTube also uses javascript obfuscation that makes AJAX reverse engineering more difficult

jamiew