views:

476

answers:

4

I was recently requested by a client to build a website for their insurance business. As part of this, they want to do some screen scraping of the quote site for one of their providers. They asked if their was an API to do this, and were told there wasn't one, but that if they could get the data from their engine they could use it as they wanted to.

My question: is it even possible to perform screen scraping on the response to a form submission to another site? If so, what are the gotchas that I should look out for. Obvious legal/ethical issues aside since they already asked for permission to do what we're planning to do.

As an aside, I would prefer to do any processing in python.

Thanks

+2  A: 

You can pass a data parameter to urllib.urlopen to send POST data with the request just like you had filled out the form. You'll obviously have to take a look at what data exactly the form contains.

Also, if the form has method="GET", the request data is just part of the url given to urlopen.

Pretty much standard for scraping the returned HTML data is BeautifulSoup.

balpha
I've heard complaints about BeautifulSoup being slow; is that really the case? Another option might be Scrapy (www.scrapy.org), which is built on top of Twisted, but I haven't used either and can't make a fair comparison.
Meredith L. Patterson
I'm not so much concerned with the speed of the screen scraping tool I use, the results pages are relatively small. The problem I was concerned about was the mechanics of actually getting the data back in the first place.
Barry
BeautifulSoup is slow - it's pure Python.
Wahnfrieden
urllib2 is often a better choice than urllib, when you need to extend or customize the reading. Especially if there's any authentication required.
S.Lott
You often end up needing parts of urllib even if you use urllib2, unfortunately. It's a messy API.
Wahnfrieden
+3  A: 

A really nice library for screen-scraping is mechanize, which I believe is a clone of an original library written in Perl. Anyway, that in combination with the ClientForm module, and some additional help from either BeautifulSoup and you should be away.

I've written loads of screen-scraping code in Python and these modules turned out to be the most useful. Most of the stuff that mechanize does could in theory be done by simply using the urllib2 or httplib modules from the standard library, but mechanize makes this stuff a breeze: essentially it gives you a programmatic browser (note, it does not require a browser to work, but mearly provides you with an API that behaves like a completely customisable browser).

For post-processing, I've had a lot of success with BeautifulSoup, but lxml.html is a good choice too.

Basically, you will be able to do this in Python for sure, and your results should be really good with the range of tools out there.

jkp
A: 

I see the other two answers already mention all the major libraries of choice for the purpose... as long as the site being scraped does not make extensive use of Javascript, that is. If it IS a Javascript-heavy site and dependent on JS for the data it fetches and display (e.g. via AJAX) your problem is an order of magnitude harder; in that case, I might suggest starting with crowbar, some customization of diggstripper, or selenium, etc.

You'll have to do substantial work in Javascript and probably dedicated work to deal with the specifics of the (hypothetically JS-heavy) site in question, depending on the JS frameworks it uses, etc; that's why the job is so much harder if that is the case. But in any case you might end up with (at least in part) local HTML copies of the site's pages as displayed, and end by scraping those copies with the other tools already recommended. Good luck: may the sites you scrape always be Javascript-light!-)

Alex Martelli
A: 

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

Wahnfrieden
I mentioned this in my answer :)
jkp
And I elaborated.
Wahnfrieden