views:

31

answers:

1
In [1]: from lxml import etree

I've got an HTML document:

In [2]: root = etree.fromstring(u'''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">\n<HTML></HTML>''', etree.HTMLParser())

Its doctype is parsed correctly:

In [3]: root.getroottree().docinfo.doctype
Out[3]: u'<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">'

But when serializing it, I am losing it:

In [4]: etree.tostring(root.getroottree(), method='html')
Out[4]: '<html></html>'

What should I do to get that doctype serialized?

Debian GNU/Linux, Sid. Python 2.6.6. lxml 2.2.8-2.

A: 

The only way I've been able to get it to work so far is by using the default XML parser and adding a non-empty system URL to the document:

>>> html = etree.parse(StringIO('''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'''))
>>> etree.tostring(html, method="xml")
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML/>'
>>> etree.tostring(html, method="html")
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'

The same thing using the HTMLParser results in the same docinfo, but not the desired output:

>>> html = etree.parse(StringIO('''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'''), etree.HTMLParser())
>>> etree.tostring(html, method="html")
'<html></html>'
bosmacs
Thanks, but my input is usually invalid XML -- therefore HTML parser. I filed a bug: https://bugs.launchpad.net/lxml/+bug/659367
liori
No problem, I figured that might be the case.
bosmacs