tags:

views:

45

answers:

2

I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...>. For example

from lxml import etree

tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etree.XMLParser())
print etree.tostring(tree)

will result in

<dmodule>test</dmodule>

Does anyone know why the <?xml ...> element is being removed? I thought encoding tags were valid XML. Thanks for your time.

A: 

Does anyone know why the <?xml ...> element is being removed?

XML defaults to version 1.0 in UTF-8 so the document is equivalent if you remove them.

You are parsing some XML to a data structure and then converting that data structure back to XML. You will get a representation of that data structure in XML, but it might not be expressed in the same way (so the prolog can be removed and <foo /> can be exchanged with <foo></foo> and so on).

David Dorward
Is there any way to keep it in there?
Axsuul
What for? It makes absolutely zero difference to any XML parser.
bobince
+1  A: 

The <?xml> element is an XML declaration, so it's not strictly an element. It just gives info about the XML tree below it.

If you need to print it out with lxml, there is some info here about the xmlDeclaration=TRUE flag you can use.

http://codespeak.net/lxml/api.html#serialisation

etree.tostring(tree, xml_declaration=True)
VMDX
Thanks, this what I was looking for. Additionally, I had to add`etree.tostring(tree, xml_declaration=True, encoding="utf-8")`to get the encoding I wanted
Axsuul
@Axsuul: utf-8 is the default encoding
John Machin