views:

40

answers:

2

This code takes a bit of bad html, uses the Tidy library to clean it up and then passes it to an HtmlLib.Reader().

import tidy
options = dict(output_xhtml=1, 
                add_xml_decl=1, 
                indent=1, 
                tidy_mark=0)

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

I'm not passing fromString with the right type, it seems, with this Traceback:

Traceback (most recent call last):
  File "getComicEmbed.py", line 33, in <module>
    doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found

What should I do differently? Thanks!

A: 

I haven't used the Python tidy module, and am not sure how to find it, but it looks like you need to call something like toString on the result of tidy.fromString to convert your parsed document back into XHTML.

For a different approach, you could consider using lxml.html, which is decent at parsing broken markup and provides you with a great ElementTree API for working with the result. It can also pretty-print *ML, which makes it sort of a superset of tidy, though perhaps not with quite the ability to navigate incoherent markup.

Also: lxml is written in C (actually, like the python tidy module(s), just wraps a C library) so it's much faster than some of the other python modules for working with XML.

intuited
+2  A: 

tidy's parseString function returns a _Document instance which implements __str__ but not a buffer interface. Therefore HtmlLib.Reader().fromString cannot create a StringIO object out of it.

This should be fairly simple, change:

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

to

doc = reader.fromString(str(tidy.parseString("<Html>Bad Html.", **options)))
AndiDog