views:

994

answers:

4

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is faster.

So I'm wondering what are the advantages of one over the other? When would I want to use lxml and when would I be better off using BeautifulSoup? Are there any other libraries worth considering?

+11  A: 

For starters, BeautifulSoup is no longer actively maintained, and the author even recommends alternatives such as lxml.

Alex Brasetvik
+1 Didn't know about the decay of BeautifulSoup, which I rely upon and adore.
Jonathan Feinberg
Well, lxml says it has good performance, while someone here said BeautifulSoup had really slow performance. It also seems to have decent API. http://codespeak.net/lxml/performance.html
JohnnySoftware
+2  A: 

I've used lxml with great success for parsing HTML. It seems to do a good job of handling "soupy" HTML, too. I'd highly recommend it.

Here's a quick test I had lying around to try handling of some ugly HTML:

import unittest
from StringIO import StringIO
from lxml import etree

class TestLxmlStuff(unittest.TestCase):
    bad_html = """
        <html>
            <head><title>Test!</title></head>
            <body>
                <h1>Here's a heading
                <p>Here's some text
                <p>And some more text
                <b>Bold!</b></i>
                <table>
                   <tr>row
                   <tr><td>test1
                   <td>test2
                   </tr>
                   <tr>
                   <td colspan=2>spanning two
                </table>
            </body>
        </html>"""

    def test_soup(self):
        """Test lxml's parsing of really bad HTML"""
        parser = etree.HTMLParser()
        tree = etree.parse(StringIO(self.bad_html), parser)
        self.assertEqual(len(tree.xpath('//tr')), 3)
        self.assertEqual(len(tree.xpath('//td')), 3)
        self.assertEqual(len(tree.xpath('//i')), 0)
        #print(etree.tostring(tree.getroot(), pretty_print=False, method="html"))

if __name__ == '__main__':
    unittest.main()
overthink
+2  A: 

Don't use BeautifulSoup, use lxml.soupparser then you're sitting on top of the power of lxml and can use the good bits of BeautifulSoup which is to deal with really broken and crappy HTML.

Peter Bengtsson
+5  A: 

pyquery provides the jquery selector interface to Python (using lxml under the hood).

http://pypi.python.org/pypi/pyquery

It's really awesome, i don't use anything else anymore.

mikeal