views:

50

answers:

2

I am using ElementTree and cannot figure out if the childnode is text or not. childelement.text does not seem to work as it gives false positive even on nodes which are not text nodes.

Any suggestions?

Example

<tr>
  <td><a href="sdas3">something for link</a></td>
  <td>tttttk</td>
  <td><a href="tyty">tyt for link</a></td>
</tr>

After parsing this xml file, I do this in Python:

for elem_main in container_trs: #elem_main is each tr
    elem0 = elem_main.getchildren()[0] #td[0]
    elem1 = elem_main.getchildren()[1] #td[1]

    elem0 = elem_main.getchildren()[0]
    print elem0.text

    elem1 = elem_main.getchildren()[1]
    print elem1.text

The above code does not output elem0.text; it is blank. I do see the elem1.text (that is, tttttk) in the output.

Update 2

I am actually building a dictionary. The text from the element with each so that I can sort the HTML table. How would I get the s in this code?

+1  A: 

How about using the getiterator method to iterate through the all the descendant nodes:

import xml.etree.ElementTree as xee

content='''
<tr>
  <td><a href="sdas3">something for link</a></td>
  <td>tttttk</td>
  <td><a href="tyty">tyt for link</a></td>
</tr>
'''

def text_content(node):
    result=[]
    for elem in node.getiterator():
        text=elem.text
        if text and text.strip():
            result.append(text)
    return result

container_trs=xee.fromstring(content)
adict={}
for elem in container_trs:
    adict[elem]=text_content(elem)
print(adict)
# {<Element td at b767e52c>: ['tttttk'], <Element td at b767e58c>: ['tyt for link'], <Element td at b767e36c>: ['something for link']}

The loop for elem_main in container_trs: iterates through the children of cantainer_trs.

In contrast, the loop for elem_main in container_trs.getiterator(): iteraters through container_trs itself, and its children, and grand-children, etc.

unutbu
I am actually building a dictionary. The text from the element with each `<tr>` so that I can sort the HTML table. How would I get the `<tr>`s in this code?
AJ
@AJ: I've changed the code a bit to show how you could grab all the text below each `td` node.
unutbu
Thanks. I will check it tomorrow and let you know.
AJ
thanks a lot !!!
AJ
+1  A: 

elem0.text is None because the text is actually part of the <a> subelement. Just go one level deeper:

print elem0.getchildren()[0].text

By the way, elem0[0].text is a shortcut for that same construct -- no need for getchildren().

ianmclaury
I know this. I just want to know how to check whether I need to go one level deeper?
AJ