views:

2640

answers:

5

I'm wanting to create a REST API for TV listings in my country. While online aggregations of TV listings do exist they're too tied to the presentation to be of any use to software developers.

In order to get hold of this information I'm thinking of going to each source and scraping the relevant information. While I've obtained similar information from HTML pages before it was an extremely cumbersome process. Do any Python features/libraries exist that would make this process easier?

+18  A: 

Beautiful Soup will save you a great deal of pain.

Ali A
This is wonderful! Thanks Ali!
Seconded. BS is the first thing that naturally comes to mind.
ayaz
I've also just recently discovered BeautifulSoup. Up until then I didn't know it was possible to fall in love with a piece of code... :b
efotinis
+8  A: 

Another option is to use lxml.html. I've occasionally found this to handle some pages better than BeautifulSoup (odd HTML comment corner cases), and the API may be more familiar if you've worked with XML. If BeautifulSoup does handle certain pages better, you can still use it while retaining the same interface by using soupparser module.

Brian
For all the good press BeautifulSoup gets in the python community, I've found that 4 of the 6 sites I've scraped today make the latest version of BS choke, while lxml.html works perfectly. I could be doing something wrong tho I reckon...
Prairiedogg
I'm finding the biggest problem is CSS/JavaScript insanity (www.ebay.com for example makes BeautifulSoup choke horribly, weird quoting, etc. Slashdot is another site, all their links start with // instead of http://).
Kurt
A: 

While BeautifulSoup is a good piece of code, depending on what you are trying to extract from the web page, you may not need that much intelligence. The data you're looking for may be easily picked out by a regular expression, for example.

Ned Batchelder
You're succumbing to the temptations of the dark god Cthulhu!http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Acorn
A: 

This isn't answering you question directly, but you may want to think about getting your data from another service, such as Schedules Direct. They provide XML, and it's the recommended data provider for xmltv.

Jeremy Cantrell
+1  A: 

Use mechanize to automate browsing, and BeautifulSoup to parse the HTML. (I do lots of stuff like what you described.)

RexE