ansaurus

Question

Parse text of element with empty element inside

Answer 1

A:

I don't think the tags being empty is your problem. xml.etree may not expect you to have child elements and bare text nodes mixed together.

BeautifulSoup is great for parsing XML or HTML that isn't well formatted:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(open('in.html').read())
print "\n".join(["<line>%s</line>" % node.strip() for node in soup.find('td').contents if isinstance(node, BeautifulSoup.NavigableString)])

Drew Sears 2010-06-02 18:19:51

Answer 2

+1 A:

You need to use the .tail property of the <br> elements.

import xml.etree.ElementTree as et

doc = """<TD>
  Textline1<BR/>
  Textline2<BR/>
  Textline3
</TD>
"""

e = et.fromstring(doc)

items = []
for x in e.getiterator():
    if x.text is not None:
        items.append(x.text.strip())
    if x.tail is not None:
        items.append(x.tail.strip())

doc2 = et.Element("lines")
for i in items:
    l=et.SubElement(doc2, "line")
    l.text = i

print(et.tostring(doc2))

EnigmaCurry 2010-06-02 18:35:45

aarrgghh use `if foo is not None:` not `if foo != None`

John Machin 2010-06-02 22:32:01

Of course you're right John, I normally would. I've just spent the last 9 hours coding Java though so I slipped :(

EnigmaCurry 2010-06-02 23:57:21

You must have committed a really serious offence to merit such a sentence as 9 hours Java coding.

John Machin 2010-06-03 01:12:06

ansaurus

tags:

views:

answers:

Parse text of element with empty element inside

related questions