views:

60

answers:

1

How can I iterate over all tags which have a specific attribute with a specific value? For instance, let's say we need the data1, data2 etc... only.

<html>
    <body>
        <invalid html here/>
        <dont care> ... </dont care>
        <invalid html here too/>
        <interesting attrib1="naah, it is not this"> ... </interesting tag>
        <interesting attrib1="yes, this is what we want">
            <group>
                <line>
                    data
                </line>
            </group>
            <group>
                <line>
                    data1
                <line>
            </group>
            <group>
                <line>
                    data2
                <line>
            </group>
        </interesting>
    </body>
</html>

I tried BeautifulSoup but it can't parse the file. lxml's parser, though, seems to work:

broken_html = get_sanitized_data(SITE)

parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)

result = etree.tostring(tree.getroot(), pretty_print=True, method="html")

print(result)

I am not familiar with its API, and I could not figure out how to use either getiterator or xpath.

+1  A: 

Here's one way, using lxml and the XPath 'descendant::*[@attrib1="yes, this is what we want"]'. The XPath tells lxml to look at all the descendants of the current node and return those with an attrib1 attribute equal to "yes, this is what we want".

import lxml.html as lh 
import cStringIO

content='''
<html>
    <body>
        <invalid html here/>
        <dont care> ... </dont care>
        <invalid html here too/>
        <interesting attrib1="naah, it is not this"> ... </interesting tag>
        <interesting attrib1="yes, this is what we want">
            <group>
                <line>
                    data
                </line>
            </group>
            <group>
                <line>
                    data1
                <line>
            </group>
            <group>
                <line>
                    data2
                <line>
            </group>
        </interesting>
    </body>
</html>
'''
doc=lh.parse(cStringIO.StringIO(content))
tags=doc.xpath('descendant::*[@attrib1="yes, this is what we want"]')
print(tags)
# [<Element interesting at b767e14c>]
for tag in tags:
    print(lh.tostring(tag))
# <interesting attrib1="yes, this is what we want"><group><line>
#                     data
#                 </line></group><group><line>
#                     data1
#                 <line></line></line></group><group><line>
#                     data2
#                 <line></line></line></group></interesting>
unutbu
Thanks, you saved my day!
myle