ansaurus

Question

Answer 1

A:

I haven't used the Python tidy module, and am not sure how to find it, but it looks like you need to call something like toString on the result of tidy.fromString to convert your parsed document back into XHTML.

For a different approach, you could consider using lxml.html, which is decent at parsing broken markup and provides you with a great ElementTree API for working with the result. It can also pretty-print *ML, which makes it sort of a superset of tidy, though perhaps not with quite the ability to navigate incoherent markup.

Also: lxml is written in C (actually, like the python tidy module(s), just wraps a C library) so it's much faster than some of the other python modules for working with XML.

intuited 2010-10-15 09:50:21

Answer 2

+2 A:

tidy's parseString function returns a _Document instance which implements __str__ but not a buffer interface. Therefore HtmlLib.Reader().fromString cannot create a StringIO object out of it.

This should be fairly simple, change:

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

to

doc = reader.fromString(str(tidy.parseString("<Html>Bad Html.", **options)))

AndiDog 2010-10-15 09:55:22

ansaurus

tags:

views:

answers:

Python - HTML Parsing with Tidy

related questions