views:

541

answers:

9

Screen scraping seems like a useful tool - you can go onto someone else's site and steal their data - how wonderful!

But I'm having a hard time with how useful this could be.

Most application data is pretty specific to that application even on the web. For example, let's say I scrape all of the questions and answers off of StackOverflow or all of the results off of Google (assuming this were possible) - I'm left with data that is not very useful unless I either have a competing question and answer site (in which case the stolen data will be immediately obvious) or a competing search engine (in which case, unless I have an algorithm of my own, my data is going to be stale pretty quickly).

So my question is, under what circumstances could the data from one app be useful to some external app? I'm looking for a practical example to illustrate the point.

+5  A: 

It's useful when a site publicly provides data that is (still) not available as an XML service. I had a client who used scraping to pull flight tracking data into one of his company's intranet applications.

The technique is also used for research. I had a client who wanted to compare the contents of several online dictionaries by part of speech, and all of these sites had to be scraped.

It is not a technique for "stealing" data. All ordinary usage restrictions apply. Many sites implement CAPTCHA mechanisms to prevent scraping, and it is inappropriate to work around these.

harpo
Why would you *need* flight tracking data within your application? Why not just link to the site that already provides this information?
010
@010 Plenty of reasons... For example, the site with flight tracking information might have superfluous data that isn't critical to your application, or it might be displayed in a format that isn't suited for the way you intend to use it. Not to mention it's a hassle to make your users go somewhere else when the data *could* be embedded directly within your application.
Donut
@dgritsko - Good point. Thanks.
010
+2  A: 

If the site has data that would benefit from being accessible through an API (and it would be free and legal to do so), but they just haven't implemented one yet, screen scraping is a way of essentially creating that functionality for yourself.
Practical example -- screen scraping would allow you to create some sort of mashup that combines information from the entire SO family of sites, since there's currently no API.

Donut
Not so practical since your duplication would be immediately obvious.
010
The point should not be to subvert the original purpose of the content's creator, but rather to extend the way the data is used in order to make it even more useful. harpo's story of comparing online dictionaries is a good example of this point.
Donut
@dgritsko - Yes, that's a good way of saying it. Thanks.
010
+1  A: 

One example from my experience.

I needed a list of major cities throughout the world with their latitude and longitude for an iPhone app I was building. The app would use that data along with the geolocation feature on the iPhone to show which major city each user of the app was closest to (so as not to show exact location), and plot them on a 3D globe of the earth.

I couldn't find an appropriate list in XML/Excel/CSV type format anywhere easily, but I did find this wikipedia page with (roughly) the info I needed. So I wrote up a quick script to scrape that page and load the data into a database.

Eric Petroelje
Good example. Although I suppose you could also do this in other ways such as by copying the web page locally and using a text editor and some regular expressions to strip out the data. But perhaps in this case you had determined that scraping would be the quickest tool to use to get the job done or the easiest tool for you to use?
010
@010 - I probably could have used a text editor and regexes to do it, but the nice thing about writing a screen scraper is that if people go to that page and add more cities to the list (it's obviously pretty incomplete) I can just re-run the scraper to pick up the new ones.
Eric Petroelje
Oh, right. That's a good point. Thanks.
010
@010" - Copying a web page locally and using a text editor and some regular expressions to strip out the data" is pretty much exactly what screen scraping is, except a computer is doing it all at once. If the former is useful, the latter is useful.
Triptych
+1  A: 

Well, to collect data from a mainframe. That's one reason why some people use screen scraping. Mainframes are still in use in the financial world and often it's running software that has been written in the previous century. The people who wrote it might already be retired and since this software is very critical for these organizations, they really hate it when some new code needs to be added. So, screenscraping offers an easy interface to communicate with the mainframe to collect information from the mainframe and then send it onwards to any process that needs this information. Rewrite the mainframe application, you say? Well, software on mainframes can be very old. I've seen software on mainframes that was over 30 years old, written in COBOL. Often, those applications work just fine and companies don't want to risk rewriting parts because it might break some code that had been working for over 30 years! Don't fix things if they're not broken, please. Of course, additional code could be written but it takes a long time for mainframe code to be used in a production environment. And experienced mainframe developers are hard to find.

I myself had to use screen scraping too in a software project. This was a scheduling application which had to capture the output to the console of every child process it started. It's the simplest form of screen scraping, actually, and many people don't even realize that if you redirect the output of one application to the input of another, that it's still a kind of screen scraping. :)

Basically, screen scraping allows you to connect one (web) application with another one. It's often a quick solution, used when other solutions would cost too much time. Everyone hates it, but the amount of time it saves still makes it very efficient.

Workshop Alex
+2  A: 

For one project we found a (cheap) commercial vendor that offered translation services for a specific file format. The vendor didn't offer an API (it was, after all, a cheap vendor) and instead had a web form to upload and download from.

With hundreds of files a day the only way to do this was to use WWW::Mechanize in Perl, screen scrape the way through the login and upload boxes, submit the file, and save the returned file. It's ugly and definitely fragile (if the vendor changes the site in the least it could break the app) but it works. It's been working now for over a year.

Nick Gotch
I wouldn't even have thought of trying to simulate a form submission to read the output. I'm not familiar with mechanize but I assume it meant that going through the form was not as complicated as you would assume.
010
The biggest hurdle was getting over a JavaScript bump, but once that was done it actually went seamlessly
Nick Gotch
+1  A: 

Any time you need a computer to read the data on a website. Screen scraping is useful in exactly the same instances that any website API is useful. Some websites, however, don't have the resources to create an API themselves; screen scraping is the developer's way around that.

For instance, in the earlier days of Stack Overflow, someone built a tool to track changes to your reputation over time, before Stack Overflow itself provided that feature. The only way to do that, since Stack Overflow has no API, was to screen scrape.

Triptych
+2  A: 

Let's say you wanted to get scores from a popular sports site that did not offer the information available with an XML feed or API.

Cody C
+2  A: 

A good example is StackOverflow - no need to scrape data as they've released it under a CC license. Already the community is crunching statistics and creating interesting graphs.

There's a whole bunch of popular mashup examples on ProgrammableWeb. You can even meet up with fellow mashupers (O_o) at events like BarCamps and Hack Days (take a sleeping bag). Have a look at the wealth of information available from Yahoo APIs (particularly Pipes) and see what developers are doing with it.

Don't steal and republish, build something even better with the data - new ways of understanding, searching or exploring it. Always cite your data sources and thank those who helped you. Use it to learn a new language or understand data or help promote the semantic web. Remember it's for fun not profit!

Hope that helps :)

Al
+1  A: 

The obvious case is when a webservice doesn't offer reverse search. You can implement that reverse search over the same data set, but it requires scraping the entire dataset.

This may be fair use if the reverse search also requires significant pre-processing, e.g. because you need to support partial matching. The data source may not have the technical skills or computing resources to provide the reverse search option.

MSalters
Reverse search - do you mean like a reverse phone number lookup service where you have the phone number but you don't have the name? Seems like an atypical requirement, no?
010
Well, that's a clear case of reverse search, but one in which the reverse search is not technically hard. It's generally a legal/regulatory matter whether that's allowed (from personal experience). A harder example would be, given a range of numbers xyz0000-xyz9999, find the entries which contain the word "fax".
MSalters