views:

1844

answers:

5

I'd like to grab daily sunrise/sunset times from here. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?

Thanks

+2  A: 

You can use urllib2 to make the HTTP requests, and then you'll have web content.

You can get it like this:

import urllib2
response = urllib2.urlopen('http://www.timeanddate.com/worldclock/astronomy.html?n=78')
html = response.read()

Beautiful Soup is a python HTML parser that is supposed to be good for screen scraping.

In particular, here is their tutorial on parsing an HTML document.

Good luck!

danben
+9  A: 

just use urllib2 in combination with the brilliant BeautifulSoup library:


import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.timeanddate.com/worldclock/astronomy.html?n=78').read())

for row in soup('table', {'class' : 'spad'})[0].tbody('tr'):
  tds = row('td')
  print tds[0].string, tds[1].string
  # will print date and sunrise
flurin
+1  A: 

Python has several options for web scraping. I enumerated some of the options here in response to a similar question.

filippo
+5  A: 

To help you choose from the various modules available, I think this talk (at pycon 2009), is what you're looking for (the guy has lots of experience on the matter). And he points out the good and the bad from most of available modules .

From the PYCON 2009 schedule:

Do you find yourself faced with websites that have data you need to extract? Would your life be simpler if you could programmatically input data into web applications, even those tuned to resist interaction by bots?

We'll discuss the basics of web scraping, and then dive into the details of different methods and where they are most applicable.

You'll leave with an understanding of when to apply different tools, and learn about a "heavy hammer" for screen scraping that I picked up at a project for the Electronic Frontier Foundation.

Atendees should bring a laptop, if possible, to try the examples we discuss and optionally take notes.

Update:

Asheesh Laroia has updated his presentation for pycon 2010

  • PyCon 2010: Scrape the Web: Strategies for programming websites that don't expected it

    * My motto: "The website is the API."
    * Choosing a parser: BeautifulSoup, lxml, HTMLParse, and html5lib.
    * Extracting information, even in the face of bad HTML: Regular expressions, BeautifulSoup, SAX, and XPath.
    * Automatic template reverse-engineering tools.
    * Submitting to forms.
    * Playing with XML-RPC
    * DO NOT BECOME AN EVIL COMMENT SPAMMER.
    * Countermeasures, and circumventing them:
          o IP address limits
          o Hidden form fields
          o User-agent detection
          o JavaScript
          o CAPTCHAs 
    * Plenty of full source code to working examples:
          o Submitting to forms for text-to-speech.
          o Downloading music from web stores.
          o Automating Firefox with Selenium RC to navigate a pure-JavaScript service. 
    * Q&A; and workshopping
    * Use your power for good, not evil. 
    
Diego Castro
A: 

All this is cool but how do we interact with the target web page to get it to the real place we want to extract the data from. Many web pages encode session ids within their dynamically-generated urls, for example. As well, many pages can sense whether the visitor is a bot or a real person (I don't know how) - so what do we do?

ropz