ansaurus

Question

Need python lxml syntax help for parsing html

Answer 1

+6 A:

Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.

zweiterlinde 2009-03-02 17:51:14

+1: lxml is for xml. Beautiful Soup is for HTML.

S.Lott 2009-03-02 18:04:42

I started with Beautiful Soup, but I had no luck. I mentioned in my question that my doc is fairly well-formed, but it is missing the ending body block. It simply drops all the content when I pull it into the parser. Hence lxml. Also, http://tinyurl.com/37u9gu indicated better mem mgmt with lxml

Shaheeb Roshan 2009-03-02 20:22:50

I used BeautifulSoup at first, but it doesn't handle bad HTML as well as it claims. It also doesn't support items with multiple classes, etc. lxml.html is better for everything I've done with it.

endolith 2010-01-26 04:59:57

Answer 2

+7 A:

Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.

However, I personally prefer Ian Bicking's HTML parser included in lxml.

Secondly, .find() and .findall() come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.

Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath() method or, if you are already familiar with CSS, using the csselect() method.

Here are some examples, with an HTML string parsed like this:

from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)

Using the css selector class your program would roughly look something like this:

# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

The equivalent using xpath method would be:

# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

Van Gale 2009-03-02 19:27:20

Yay! Just what I needed. I interpreted the cssselect to actually require the elements to have a declared css class. The nested finding logic is just what I needed! Thank you Van Gale!

Shaheeb Roshan 2009-03-02 20:48:21

This page recommends to use iterchildren and iterdescendants with the tag option. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/#N10239

endolith 2010-01-26 04:58:20

Answer 3

A:

cssselect works beautifully for me with xhtml.

2009-05-27 18:53:49

ansaurus

tags:

views:

answers:

Need python lxml syntax help for parsing html

related questions