Today I tried lxml as I got very nasty html output from particular web service, and I didn't want to go with re module, just for change and to learn something new. And I did, browsing http://codespeak.net/lxml/ and http://stackoverflow.com in parallel
I won't try to explain above html template, but just for overview it's full of deliberately nested tables.
I extracted part of interest with html parser then find_class() and iterating through TR with xpath (and even this TRs have tables inside). Now I'm trying to extract data pairs based on class and id attributes:
- name child has class "title"
- value child has id "text"
Code looks something like this:
fragment = root.find_class('foo')
for node in fragment[0].xpath('table[2]/tr'):
name = node.xpath('//div[@id="title"]')
value = node.xpath('//td[@class="text"]')
Problem is that not every TR, that I'm iterating, has those pairs: some are only with name (id "title") so later when I try to zip them I get wrongly paired data.
I tried couple of things that came to my mind but nothing successful: I tried to compare list length (for name and value) and if they don't match skip name lookup, then if they don't match, delete last list item (in many ways) but nothing worked. For example:
if not len(name) == len(value):
name.pop()
or
if len(name) == len(value):
name = node.xpath('//div[@id="title"]')
value = node.xpath('//td[@class="text"]')
Some comments from more experienced?