tags:

views:

560

answers:

3

I'm new to Python completely and am using Python 3.1 on Windows (pywin). I need to parse some HTML, to essentially extra values between specific HTML tags and am confused at my array of options, and everything I find is suited for Python 2.x. I've read raves about Beautiful Soup, HTML5Lib and lxml, but I cannot figure out how to install any of these on Windows.

Questions:

  1. What HTML parser do you recommend?
  2. How do I install it? (Be gentle, I'm completely new to Python and remember I'm on Windows)
  3. Do you have a simple example on how to use the recommended library to snag HTML from a specific URL and return the value out of say something like this:

    fooLink

(say we want to return "/blahblah")

A: 

BeautifulSoup, with its version 3.1.0.1 (January 2009) also work with Python 3.x.

I do not have have direct experience with BeautifulSoup under Py3k (although this soon should change...).   I just read, however, that Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than its previous versions, so I may try and wait if possible (i.e. stay with Python 2.6 a bit longer).

mjv
+1  A: 

If your html is well formed, you have many options, such as sax and dom. If it is not well formed you need a fault tolerant parser such as Beautiful soup, element tidy, or lxml's html parser. No parser is perfect, when presented with a variantly of broken html, sometimes I have to try more then one. Lxml and Elementree use a mostly compatible api that is more of a standard than Beautiful soup.

In my opinion, lxml is the best module for working with xml documents, but the ElementTree included with python is still pretty good. In the past I have used Beutiful soup to convert html to xml and construct ElementTree for processing the data.

mikerobi
+2  A: 

Web-scraping in Python 3 is currently very poorly supported; all the decent libraries work only with Python 2. If you must web scrape in Python, use Python 2.

Although Beautiful Soup is oft recommended (every question regarding web scraping with Python in Stack Overflow suggests it), it's not as good for Python 3 as it is for Python 2; I couldn't even install it as the installation code was still Python 2.

As for adequate and simple-to-install solutions for Python 3, you can try the library's HTML parser, although quite barebones, it comes with Python 3.

Beau Martínez