tags:

views:

31

answers:

2

Hi all. I have the following XML document:

<x>
  <a>Some text</c>
  <b>Some text 2</b>
  <c>Some text 3</c>
</x>

I want to get the text of all the tags, so I decided to use getiterator().

My problem is, it adds up blank lines for a reason I can't understand. Consider this:

>>> for text in document_root.getiterator():
...     print text.text
... 


Some text
Some text 2
Some text 3

Notice the two extra blank lines before 'Some text'. What is the reason for this? If I pass a tag to the getiterator() method, there are no blank lines, as it should be.

>>> for text in document_root.getiterator('a'):
...     print text.text
... 
Some text

So my question is, what is causing those extra blank lines in case I pass getiterator() without a tag and how do I remove them?

A: 

Although Im not sure, I would assume it's trying to read text within < x >.

Anyhow, what's wrong with

for text in document_root.getiterator():
    if text.strip() == '': continue
    print text
Robus
Aah. I forgot I could use `strip()` too.
sukhbir
It solves my problem but the question of why it happens remains.
sukhbir
Because the element <x> contains text, in this case it's just whitespace but that's still text nonetheless.
Bruce van der Kooij
A: 

By default lxml.etree will regard empty text between tags as the textual content for that tag and in your case the whitespace being displayed comes from <x>. If you want a parser that ignores the whitespace you'll want to do something like:

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)

tree = etree.XML("""\
    <x>
      <a>Some text</a>
      <b>Some text 2</b>
      <c>Some text 3</c>
    </x>
""", parser)

for node in tree.iter():
    if node.text == None: continue
    print node.text

Note how node.text will return None if there is no text at all. Also note that the API documentation for lxml states that getiterator() is deprecated in favor of iter().

For more information see The lxml.etree Tutorial: Parser objects.

Bruce van der Kooij
Aah thanks. This answers my question perfectly.
sukhbir
Glad to be of service ;-)
Bruce van der Kooij