ansaurus

Question

Manipulating list from lxml xpath queries

Answer 1

A:

How's this?

from lxml import etree
doc = etree.HTML(open('test.data').read())

for t in doc.xpath('//table[.//div[@id="title"] and .//td[@class="text"]]'):
    print etree.tostring(t.xpath('.//div[@id="title"]')[0])
    print etree.tostring(t.xpath('.//td[@class="text"]')[0])
    print "--"

Yielding:

<div id="title">
              <span class="Browse">string</span>
            </div>

<td class="text" style="padding-left:5px;">
            <a href="/***/***.dll?p=***&amp;sql=xxx:yyy">string</a>
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            <a href="/***/***.dll?p=***&amp;sql=xxx:yyy">string</a>
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            Gospodar of Lutaka
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            1986
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            Sep 1985-Dec 1985
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            Elektra
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
            54:51
          </td>

--
<div id="title">
              <span>string</span>
            </div>

<td class="text" style="padding-left:5px;">
          </td>

--

Update, extended the leading portion of the xpath expression to eliminate an undesired result. Thanks to Alejandro for pointing this out and suggesting a fix that didn't seem to work out for otrov.

from urllib2 import urlopen
from lxml import etree
doc = etree.HTML(urlopen('http://pastebin.com/download.php?i=cg5HHJ6x').read())

for t in doc.xpath('//table/tr/td/table[.//div[@id="title"] and .//td[@class="text"]]'):
    print etree.tostring(t.xpath('.//div[@id="title"]')[0])
    print etree.tostring(t.xpath('.//td[@class="text"]')[0])
    print "--"

MattH 2010-08-12 14:23:52

Excellent! Thanks for the correct answer and nice lesson. Now I can continue with the rest of the code :)

otrov 2010-08-12 14:37:29

@otrov: You're welcome! I personally have found xpath a steep learning curve, the examples in questions on this site are pretty handy, there are some XLST/XPATH guru's lurking on SO. Thank you for bumping me over 2000! :)

MattH 2010-08-12 14:43:57

Eh, that's nice, it seems that you have provided yourself ticket to near future :) Good luck

otrov 2010-08-12 14:52:38

@MattH: Check the answer, I think it has something wrong: there is no `td[@class="text"]` for `div[@id="title"][span/@class="Browse"]`

Alejandro 2010-08-12 16:06:28

@Alejandro, sorry I'm being a little dense today. Are you saying that this solution is not returning some data that should be returned?

MattH 2010-08-12 17:11:18

@MattH: No. I'm saying that it's returning some data (first pair) that should not be returned.

Alejandro 2010-08-12 18:11:07

@Alejandro: You noticed that right - there are redundant data pairs! For example: unmatched "name" from the second TR element is paired with value from the third TR element, BUT - valid pairs are matched correctly and that was my main problem, as in later code I iterate this pairs against known "name" data which results fine

otrov 2010-08-13 01:53:02

Answer 2

A:

Now, with input sample, is more clear what you are asking.

Just this one XPath 1.0 expression return a node set with div and td pair (in document order):

/table/tr/td/table[tr/td/div[@id='title']]
                  [tr/td[@class='text']]
                  /tr//*[self::div[@id='title'] or self::td[@class='text']]

As proof, this stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:template match="/">
        <result>
            <xsl:copy-of 
                 select="/table/tr/td/table[tr/td/div[@id='title']]
                                           [tr/td[@class='text']]
                                           /tr//*[self::div[@id='title'] or
                                                  self::td[@class='text']]"/>
        </result>
    </xsl:template>
</xsl:stylesheet>

Output (with proper input sample, because you miss a closing td):

<result>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
        <a href="/***/***.dll?p=***&amp;sql=xxx:yyy">string</a>
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            Gospodar of Lutaka
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            1986
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            Sep 1985-Dec 1985
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            Elektra
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;">
            54:51
    </td>
    <div id="title">
        <span>string</span>
    </div>
    <td class="text" style="padding-left:5px;"></td>
</result>

Alejandro 2010-08-12 16:00:13

About missing </td>: I quickly checked, and there are indeed odd number of closing </td> tags and even number of opening <td> tags - and that is exactly what web service produces, which I don't want to expose here, but I can send you link by e-mail or similar if you want to check my writing.

otrov 2010-08-13 01:54:32

At the end I believe you xpath expression works fine as you checked it with XSLT, but not when I try to put it in my code. As example I took MattH snippet then insted for(): block, I put <code>"node = doc.xpath('/table/tr/td/table[tr/td/div[@id="title"]][tr/td[@class="text"]]/tr//*[self::div[@id="title"] or self::td[@class="text"]]')"</code> which does not produce result, similar as Dimitrie's deleted answer. So I probably should have done this with known regex module instead starting to learn lxml on such uncomfortable example

otrov 2010-08-13 01:55:06

@otrov, I've updated my solution with a more specific expression inspired by Alejandro, I hope it serves you better. Honestly any time spent using xpath instead of regexps for processing HTML or XML is time well spent!

MattH 2010-08-13 06:34:50

@Matt: your initial code was good for my usage, but I use now that slightly modified version which serves my code basically the same :) Thanks for encouragement on XPATH, I tried to get some sense in the past from XML/XSLT but failed, I guess I need to put more brain cells to this subject@Alejandro: Thanks for the answer, I'll look more at translating general XPATH expression to lxml xpath expression :)

otrov 2010-08-13 10:17:28

@otrov: if `table` element is not your root element (as it is in your posted input sample) then you could add the missing path to `table` element (as `/html/body/`, etc.). I do not recommend to beging a path with `//` operator because it navegates all the tree.

Alejandro 2010-08-13 13:11:06

yeah :) that was embarrassing, I needed to put /table[2]/ instead /table/ Cheers

otrov 2010-08-13 15:25:28

ansaurus

tags:

views:

answers:

Manipulating list from lxml xpath queries

related questions