views:

307

answers:

2

Hi all! I am using ElementTree to parse a XML file. In some fields, there will be HTML data. For example, consider a declaration as follows:

<Course>
    <Description>Line 1<br />Line 2</Description>
</Course>

Now, supposing _course is an Element variable which hold this Couse element. I want to access this course's description, so I do:

desc = _course.find("Description").text;

But then desc only contains "Line 1". I read something about the .tail attribute, so I tried also:

desc = _course.find("Description").tail;

And I get the same output. What should I do to make desc be "Line 1
Line 2" (or literally anything between and )? In other words, I'm looking for something similar to the .innerText property in C# (and many other languages I guess).

+4  A: 

Do you have any control over the creation of the xml file? The contents of xml tags which contain xml tags (or similar), or markup chars ('<', etc) should be encoded to avoid this problem. You can do this with either:

  • a CDATA section
  • Base64 or some other encoding (which doesn't include xml reserved characters)
  • Entity encoding ('<' == '&lt;')

If you can't make these changes, and ElementTree can't ignore tags not included in the xml schema, then you will have to pre-process the file. Of course, you're out of luck if the schema overlaps html.

Dana the Sane
Using a CDATA section solved the problem. Thanks!
Rafael Almeida
+1  A: 

Characters like "<" and "&" are illegal in XML elements.

"<" will generate an error because the parser interprets it as the start of a new element.

"&" will generate an error because the parser interprets it as the start of an character entity.

Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.

Everything inside a CDATA section is ignored by the parser.

A CDATA section starts with "":

More information on: http://www.w3schools.com/xmL/xml_cdata.asp

Hope this helps!

ylebre