views:

467

answers:

14

I want to protect only certain numbers that are displayed after each request. There are about 30 such numbers. I was planning to have images generated in the place of those numerbers, but if the image is not warped as with captcha, wont scripts be able to decipher the number anyway? Also, how much of a performance hit would loading images be vs text?

+1  A: 

Generate an image containing those numbers and display the image. :-)

Ron

Ron Savage
+1  A: 

I can't believe I'm promoting a common malware scripting tactic, but...

You could encode the numbers as encoded Javascript that gets rendered at runtime.

routeNpingme
+5  A: 

Apart from the images, you could display the numbers using JavaScript or flash.

You could also use CSS to position individual digits using various combinations of absolute or relative positions. You could also use JavaScript to help you create these DIV. The point is just to obfuscate enough that it becomes really hard.

One more solution is to use images of segments or single dots and re-construct the images of the digits using CSS, a bit like a dot-matrix display. You could litter the source of the page with these absolutely positioned DIVs and again make it more difficult to reconstruct by creating them dynamically.

At any rate, you can't stop a determined scrapper from getting to the data: it doesn't take a lot to automate a web browser and take screenshots that can be fed to an OCR. There is nothing anyone from paying someone pennies to get the data manually anyway.

The point is: how determined are your opponents (user?).
It's a bit like the software protection business: making things hard enough that you would deter casual 'pirates' is not too hard, and it's a fairly good approach in general.

However, if there is much value in the data you present, there is nothing you can really do to protect it.
All you can do it make it hard enough so that casual 'thieves' will prefer to continue paying for your services rather than circumvent it.

Renaud Bompuis
Using JS, you could do an AJAX request once DOM is ready and load all those numbers in one batch. Then just assign them to appropriate elements. Keep in mind though, it only works if JS is off at the scraping side.
MK_Dev
A: 

Can you provide a little more detail on what it is you're doing? Certainly there's a performance hit to create an image instead of dumping out the text of a number, but how often would you be doing this per day?

Using JavaScript is the same as using text. It's trivial to reverse engineer.

Tom
+2  A: 

Javascript would probably be the easiest to implement, but you could get really creative and have large blocks of numbers with certain ones being viewable by placing layers on top of the invalid numbers, blending the wrong numbers into the background, or making them invisible via css and semi-randomly generated class names.

jsoverson
+9  A: 

The only way to make sure bad-guys don't get your data is not to share it with anyone. Any other solution is essentially entering an arms race with the screen-scrapers. At one point or another, one of you will find the arms-race too costly to continue. If the data you are sharing has any perceptible value, then probably the screen-scrapers will be very determined.

TokenMacGuy
this is true, basically if you're giving it away for free there's not much you can do to stop people taking it
nailitdown
A: 

Use animated numbers using flash. It may not be fool proof but it would make it harder to crack.

Ramesh
A: 

What about posting a lot of dummy numbers and showing the right ones with external CSS? Just as long the scraper doesn't start to parse the external CSS.

alex
A: 

Don't output the numbers, i.e. prefix

echo $secretNumber;

with //.

phihag
It's a good answer, but probably slightly inconvenient for the non screen-scrapers (i.e. normal users) ;)
alex
Nuke the server from orbit.. It's the only way to be sure ;)
Blorgbeard
A: 

For all those that recommend using Javascript, or CSS to obfuscate the numbers, well there's probably a way around it. Firefox has a plugin called abduction. Basically what it does is saves the page to a file as an image. You could probably modify this plugin to save the image, and then analyze the image to find out the secret number that is trying to be hidden.

Basically, if there's enough incentive behind scraping these numbers from the page, then it will be done. Otherwise, just post a regular number, and make it easier on your users so they won't have to worry so much about not being able to copy and paste the number, or other such problems the result from this trickery.

Kibbee
+6  A: 

It's not possible.

  • You use javascript and encrypt the page, using document.write() calls after decrypting. I either scrape from the browser's display or feed the page through a JS engine to get the output.
  • You use Flash. I can poke into the flash file and get the values. You encrypt them in the flash and I can just run it then grab the output from the interpreter's display as a sequence of images.
  • You use images and I can just feed them through an OCR.

You're in an arms race. What you need to do is make your information so useful and your pages so easy to use that you become the authority source. It's also handy to change your output formats regularly to keep up, but screen scrapers can handle that unless you make fairly radical changes. Radical changes drive users away because the page is continually unfamiliar to them.

Your image solution wont' help much, and images are far less efficient. A number is usually only a few bytes long in HTML encoding. Images start at a few hundred bytes and expand to a 1k or more depending on how large you want. Images also will not render in the font the user has selected for their browser window, and are useless to people who use assisted computing devices (visually impaired people).

Adam Hawes
A: 

just do something unexpected and weird (different every time) w/ CSS box model. Force them to actually use a browser backed screenscraper.

Joshua
A: 

I don't think this is possible, you can make their job harder (use images as some suggested here) but this is all you can do, you can't stop a determined person from getting the data, if you don't want them to scrape your data, don't publish it, as simple as that ...

Waleed Eissa
A: 

Assuming these numbers are updated often (if they aren't then protecting them is completely moot as a human can just transcribe them by hand) you can limit automated scraping via throttling. An automated script would have to hit your site often to check for updates, if you can limit these checks you win, without resorting to obfuscation.

For pointers on throttling see this question.

Max Caceres