tags:

views:

56

answers:

2

Today I tried lxml as I got very nasty html output from particular web service, and I didn't want to go with re module, just for change and to learn something new. And I did, browsing http://codespeak.net/lxml/ and http://stackoverflow.com in parallel

I won't try to explain above html template, but just for overview it's full of deliberately nested tables.

I extracted part of interest with html parser then find_class() and iterating through TR with xpath (and even this TRs have tables inside). Now I'm trying to extract data pairs based on class and id attributes:

  • name child has class "title"
  • value child has id "text"

Code looks something like this:

fragment = root.find_class('foo')

for node in fragment[0].xpath('table[2]/tr'):
    name = node.xpath('//div[@id="title"]')
    value = node.xpath('//td[@class="text"]')

Problem is that not every TR, that I'm iterating, has those pairs: some are only with name (id "title") so later when I try to zip them I get wrongly paired data.

I tried couple of things that came to my mind but nothing successful: I tried to compare list length (for name and value) and if they don't match skip name lookup, then if they don't match, delete last list item (in many ways) but nothing worked. For example:

if not len(name) == len(value):
    name.pop()

or

if len(name) == len(value):
    name = node.xpath('//div[@id="title"]')

value = node.xpath('//td[@class="text"]')

Some comments from more experienced?

A: 

How's this?

from lxml import etree
doc = etree.HTML(open('test.data').read())

for t in doc.xpath('//table[.//div[@id="title"] and .//td[@class="text"]]'):
    print etree.tostring(t.xpath('.//div[@id="title"]')[0])
    print etree.tostring(t.xpath('.//td[@class="text"]')[0])
    print "--"

Yielding:

<div id="title">
              <span class="Browse">string</span>
            </div>

<td class="text" style="padding-left:5px;">
            <a href="/***/***.dll?p=***&amp;sql=xxx:yyy">string</a>
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            <a href="/***/***.dll?p=***&amp;sql=xxx:yyy">string</a>
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            Gospodar of Lutaka
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            1986
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            Sep 1985-Dec 1985
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            Elektra
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            54:51
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
          </td>

--

Update, extended the leading portion of the xpath expression to eliminate an undesired result. Thanks to Alejandro for pointing this out and suggesting a fix that didn't seem to work out for otrov.

from urllib2 import urlopen
from lxml import etree
doc = etree.HTML(urlopen('http://pastebin.com/download.php?i=cg5HHJ6x').read())

for t in doc.xpath('//table/tr/td/table[.//div[@id="title"] and .//td[@class="text"]]'):
    print etree.tostring(t.xpath('.//div[@id="title"]')[0])
    print etree.tostring(t.xpath('.//td[@class="text"]')[0])
    print "--"
MattH
Excellent! Thanks for the correct answer and nice lesson. Now I can continue with the rest of the code :)
otrov
@otrov: You're welcome! I personally have found xpath a steep learning curve, the examples in questions on this site are pretty handy, there are some XLST/XPATH guru's lurking on SO. Thank you for bumping me over 2000! :)
MattH
Eh, that's nice, it seems that you have provided yourself ticket to near future :) Good luck
otrov
@MattH: Check the answer, I think it has something wrong: there is no `td[@class="text"]` for `div[@id="title"][span/@class="Browse"]`
Alejandro
@Alejandro, sorry I'm being a little dense today. Are you saying that this solution is not returning some data that should be returned?
MattH
@MattH: No. I'm saying that it's returning some data (first pair) that should not be returned.
Alejandro
@Alejandro: You noticed that right - there are redundant data pairs! For example: unmatched "name" from the second TR element is paired with value from the third TR element, BUT - valid pairs are matched correctly and that was my main problem, as in later code I iterate this pairs against known "name" data which results fine
otrov
A: 

Now, with input sample, is more clear what you are asking.

Just this one XPath 1.0 expression return a node set with div and td pair (in document order):

/table/tr/td/table[tr/td/div[@id='title']]
                  [tr/td[@class='text']]
                  /tr//*[self::div[@id='title'] or self::td[@class='text']]

As proof, this stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:template match="/">
        <result>
            <xsl:copy-of 
                 select="/table/tr/td/table[tr/td/div[@id='title']]
                                           [tr/td[@class='text']]
                                           /tr//*[self::div[@id='title'] or
                                                  self::td[@class='text']]"/>
        </result>
    </xsl:template>
</xsl:stylesheet>

Output (with proper input sample, because you miss a closing td):

<result>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
        <a href="/***/***.dll?p=***&amp;sql=xxx:yyy">string</a>
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            Gospodar of Lutaka
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            1986
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            Sep 1985-Dec 1985
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            Elektra
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            54:51
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;"></td>
</result>
Alejandro
About missing </td>: I quickly checked, and there are indeed odd number of closing </td> tags and even number of opening <td> tags - and that is exactly what web service produces, which I don't want to expose here, but I can send you link by e-mail or similar if you want to check my writing.
otrov
At the end I believe you xpath expression works fine as you checked it with XSLT, but not when I try to put it in my code. As example I took MattH snippet then insted for(): block, I put <code>"node = doc.xpath('/table/tr/td/table[tr/td/div[@id="title"]][tr/td[@class="text"]]/tr//*[self::div[@id="title"] or self::td[@class="text"]]')"</code> which does not produce result, similar as Dimitrie's deleted answer. So I probably should have done this with known regex module instead starting to learn lxml on such uncomfortable example
otrov
@otrov, I've updated my solution with a more specific expression inspired by Alejandro, I hope it serves you better. Honestly any time spent using xpath instead of regexps for processing HTML or XML is time well spent!
MattH
@Matt: your initial code was good for my usage, but I use now that slightly modified version which serves my code basically the same :) Thanks for encouragement on XPATH, I tried to get some sense in the past from XML/XSLT but failed, I guess I need to put more brain cells to this subject@Alejandro: Thanks for the answer, I'll look more at translating general XPATH expression to lxml xpath expression :)
otrov
@otrov: if `table` element is not your root element (as it is in your posted input sample) then you could add the missing path to `table` element (as `/html/body/`, etc.). I do not recommend to beging a path with `//` operator because it navegates all the tree.
Alejandro
yeah :) that was embarrassing, I needed to put /table[2]/ instead /table/ Cheers
otrov