views:

477

answers:

1

Hi,

I have embedded HTML Tidy in my application to clean incoming HTML. But Tidy has a huge amount of bugs and fixing them directly in the source is my worst nightmare. Tidy source code is an unreadable abomination. Thousand+ line functions, poor variable naming, spaghetti code etc. It's truly horrible.

Worse yet, official development seems to have ceased. In the last 12 months, there have been three write transactions to the official CVS repo. But it's been dead and buried for much longer than that...

So I'm looking for an OSS C or C++ application/library that can do what Tidy can (when it feels like it): fix bad HTML markup and transform it into valid XHTML (this is the part I'm interested in). And I mean all sorts of bad markup.

Is there something like that out there?

EDIT: I need it both for manipulations on the DOM tree by an XML handling tool and for general compliance with the XHTML spec. My app needs to accept HTML from users (which is often invalid in all sorts of ways) and output valid XHTML. It needs to be able to handle even HTML that would normally not display in a browser because the user edited it by hand and didn't check afterwards.

A drop-in replacement for Tidy's error-correcting parser... that doesn't suck. I don't mind bugs if the source is readable and I can fix problems myself, or if there are active developers who provide bugfixes on a timely basis.

+1  A: 

Could you tell us what you plan to use this tool for? As in, do you want to fix static web pages, or do you want some sort of filtering step before other manipulations, so that some tool can handle buggy web pages?

Personally, I write my own tool atop Python's BeautifulSoup or lxml whenever I need to --- it's at most a dozen line script and does much of what I want.

pavpanchekha
I can't use Python or its libraries. This is a GUI, native code application. Integrating the Python interpreter is not an option.
Lucas
Well, for a GUI native code app, technically integrating the Python interpreter *is* an option, but maybe not an appealing one when you evaluate the pros and cons. http://docs.python.org/extending/embedding.html
Craig McQueen
Then I'd look at native bindings for lxml --- it can do parsing quite well, even for horribly broken html.
pavpanchekha