ansaurus

Question

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

Answer 1

+11 A:

For starters, BeautifulSoup is no longer actively maintained, and the author even recommends alternatives such as lxml.

Alex Brasetvik 2009-12-17 14:13:22

+1 Didn't know about the decay of BeautifulSoup, which I rely upon and adore.

Jonathan Feinberg 2009-12-17 14:14:27

Well, lxml says it has good performance, while someone here said BeautifulSoup had really slow performance. It also seems to have decent API. http://codespeak.net/lxml/performance.html

JohnnySoftware 2010-01-02 03:09:33

Answer 2

+2 A:

I've used lxml with great success for parsing HTML. It seems to do a good job of handling "soupy" HTML, too. I'd highly recommend it.

Here's a quick test I had lying around to try handling of some ugly HTML:

import unittest
from StringIO import StringIO
from lxml import etree

class TestLxmlStuff(unittest.TestCase):
    bad_html = """
        <html>
            <head><title>Test!</title></head>
            <body>
                <h1>Here's a heading
                <p>Here's some text
                <p>And some more text
                <b>Bold!</b></i>
                <table>
                   <tr>row
                   <tr><td>test1
                   <td>test2
                   </tr>
                   <tr>
                   <td colspan=2>spanning two
                </table>
            </body>
        </html>"""

    def test_soup(self):
        """Test lxml's parsing of really bad HTML"""
        parser = etree.HTMLParser()
        tree = etree.parse(StringIO(self.bad_html), parser)
        self.assertEqual(len(tree.xpath('//tr')), 3)
        self.assertEqual(len(tree.xpath('//td')), 3)
        self.assertEqual(len(tree.xpath('//i')), 0)
        #print(etree.tostring(tree.getroot(), pretty_print=False, method="html"))

if __name__ == '__main__':
    unittest.main()

overthink 2009-12-17 14:19:18

Answer 3

+2 A:

Don't use BeautifulSoup, use lxml.soupparser then you're sitting on top of the power of lxml and can use the good bits of BeautifulSoup which is to deal with really broken and crappy HTML.

Peter Bengtsson 2009-12-17 14:24:12

Answer 4

+5 A:

pyquery provides the jquery selector interface to Python (using lxml under the hood).

http://pypi.python.org/pypi/pyquery

It's really awesome, i don't use anything else anymore.

mikeal 2009-12-17 18:48:09

ansaurus

tags:

views:

answers:

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

related questions