views:

679

answers:

6

I have been thinking quite a bit here lately about screen scraping and what a task it can be. So I pose the following question.

Would you as a site developer expose simple APIs to prevent users from screen scraping, such as JSON results?

These results could then implement caching, and they are much smaller for traffic than the huge amounts of markup that could potentially be downloaded.

I am not looking at prevention, but deterring scraping.


Scraping Bandwidth Sample
((users * (% / 100)) * ((freq * 60) * 24)) * filesize

  • users: 200,000
  • % of users using utility: 5
  • filesize: 1kb
  • freq: 1 minute

Formula:

((users * (% / 100)) * ((freq * 60) * 24)) * filesize

10,000 * 1440 * 1

14400000kb or 13.73291015625gb

Assuming your JSON result is 200 bytes that's now (10,000 * 1440 * 0.2) or 2.74658203125gb a day.

That's a change of about 11gb of traffic a day.


My Stack Overflow profile is 96k for reference.


The reason for this question prompted asking for a JSON result from users profiles:
http://stackoverflow.uservoice.com/pages/general/suggestions/101342-add-json-for-user-information

I wanted to find out if other developers would expose this type of API, and if it is worth your time to provide these APIs to reduce bandwidth.

A: 

If you want to provide an open model that people can develop solutions ontop of your site, then yes you should provide an API. Screen scrapping is a method of hostile integration, and should only be used in last resort.

JoshBerke
but wouldn't just exposing API's for those people trying to hostilely integrate help downsize the traffic?
Tom Anderson
I think that is what @Josh is saying: people will use screen scraping only if you don't provide an API.
tvanfosson
"Should only be used as a last resort" - unfortunately is the producer that's asking this question, not the consumer. What he thinks the last resort should be doesn't have much to do with anything.
le dorfier
I don't think you can draw a correlation btw screen scraping and traffic without knowing how large the API...but since the API should only contain data you might be safe to assume that the API will reduce traffic...
JoshBerke
@le show me a consumer who'd rather screen scrape then use an API, and I'll show you someone who cries when I break his screen scraping code because I modify my markup.
JoshBerke
+1  A: 

Screen scraping is not realistically preventable. Providing an API, while nice to those who consume your data, can't prevent it. Since the data ultimately has to be human readable, it therefore is machine readable. You would be better off spending your energy working on your site and not working for those who would consume your data (legally or not).

wget, perl, regular expressions is the common mechanism for scraping data.

Michael MacDonald
But if the same data is available via an API, why would anyone want to screen-scrape, since it's a lot harder to do than calling an API.
Joachim Sauer
those are my thoughts, then you can format to whatever human readable format you like. I am not looking at prevention, but deterring them from doing it. "ok, so you want data from my site for a program, use this api instead of hogging my bandwidth and giving me false analytics."
Tom Anderson
Sure, if you want to spend the extra effort to make the data better available, then absolutely publish an api. If I needed to I would much prefer to use an api than screen scape.
Michael MacDonald
+5  A: 

Providing an API should definitely reduce the amount of screen scraping that gets done against your site. Using a good REST API is much easier and safer than screen scraping. Screens can change without notice, and that makes screen scraping code much harder to maintain. As a developer, if I need information from a site, I'd never scrape the site if the same information was available through an API.

Bill the Lizard
+1  A: 

If you want to encourage people to integrate with your site or it is popular enough for this to be a problem (so that you are forced to allow people to integrate with it), then by all means provide an API. If your API is adequate and easy to use, then people will prefer it to screen scraping. If your API is inadequate or harder to use than a screen scraper then you may still have the problem.

tvanfosson
+2  A: 

If it's easier for technical users to use an API than it is for them to screen-scrape, they will do so. Even better so, if you can encourage people to use your APIs instead of screen-scraping, you should have a much easier time monitoring traffic, because the automated user-agents are clearly distinguished from the browser user-agents.

A RESTful JSON interface is a good choice, because it can be scripted from any other language fairly easy (show me a language that doesn't have a JSON parser and I'll show you a language nobody cares about).

Tom
A: 

Most developers will choose the technology to use for their own reasons. So if you provide an API that is easier than what they use to scrape your screens, then some unkown percent will move to it. Bandwidth reduction will probably be very low on their lists of considerations.

Since you haven't specified what's being scraped for, we can't help you guess what kind of API to provide, or what proportion will use it.

One very common tool for scraping that's hard to deflect is using Excel or some other product that makes the scrape painless.

If your intent is just to minimize the pain (which one might infer from your question), then by far the most useful thing to do is to query they scrapers - more useful than querying SO, anyway.

You might check woot.com and see what they provide on an RSS feed to unburden the web http server.

le dorfier
I am not really asking about implementation, I was trying to keep it language agnostic and discuss the time vs. bandwidth investment.
Tom Anderson
Your time, or the scraper's time? Your investment, or the scraper's investment? Or both? It's not clear where you think time and bandwidth are being traded off - there are several interpretations possible. And the choice of implementation has everything to do with the time and bandwidth.
le dorfier
publishers time vs publishers bandwidth: if i spend 30 minutes creating a json result of my users profile, it will save me upwards of 1-10k per request.
Tom Anderson
Why would you spend the 30 minutes more than once? Presumably you'd code it, then make it available by web service or RSS or whatever. If it's a typical RESTful solution, Also, a saving of 1-10k doesn't sound significant unless you have some data volume info you'd like to tell us.
le dorfier
i don't think we are on the same page here. the 30 minutes is once, the 1-10k is the html markup size where-as the json would be around 100 bytes, so therefore the savings is 1-10k of bandwidth for every request. 1k requests a day is roughly saving about 1mb of traffic (@ 1k) a DAY
Tom Anderson
to continue: now imagine a socially popular site with 200k users, 5% of those users use a reputation tracking system, so that is 10k users with this application that pings every minute, thats 1440 requests per day per user: 14,400,000 requests. Roughly 13gb of data transfer a day for this app @ 1k.
Tom Anderson
formula in topic now to explain it better, small comment allowances.
Tom Anderson
Then to answer your questions specifically, yes, other developers provide this kind of API (Amazon and Woot come to mind immediately); and yes, it's worth 30 minutes of your time if even a small proportion of your users make use of it.
le dorfier