ansaurus

Question

lxml removing <?xml ...> tags when parsing?

Answer 1

A:

Does anyone know why the <?xml ...> element is being removed?

XML defaults to version 1.0 in UTF-8 so the document is equivalent if you remove them.

You are parsing some XML to a data structure and then converting that data structure back to XML. You will get a representation of that data structure in XML, but it might not be expressed in the same way (so the prolog can be removed and <foo /> can be exchanged with <foo></foo> and so on).

David Dorward 2010-07-12 21:06:21

Is there any way to keep it in there?

Axsuul 2010-07-12 21:07:09

What for? It makes absolutely zero difference to any XML parser.

bobince 2010-07-12 21:12:15

Answer 2

+1 A:

The <?xml> element is an XML declaration, so it's not strictly an element. It just gives info about the XML tree below it.

If you need to print it out with lxml, there is some info here about the xmlDeclaration=TRUE flag you can use.

http://codespeak.net/lxml/api.html#serialisation

etree.tostring(tree, xml_declaration=True)

VMDX 2010-07-12 21:08:47

Thanks, this what I was looking for. Additionally, I had to add`etree.tostring(tree, xml_declaration=True, encoding="utf-8")`to get the encoding I wanted

Axsuul 2010-07-12 21:19:43

@Axsuul: utf-8 is the default encoding

John Machin 2010-07-12 21:39:59

ansaurus

tags:

views:

answers:

lxml removing <?xml ...> tags when parsing?

related questions