views:

5233

answers:

8

Preferably a light weight HTML parser, not exactly creating a browser or looking to modulate JS or any http connections.

+3  A: 

You could try wxHTML.

codelogic
+1  A: 

depending on the kind of HTML you're parsing, can you really get by with an XML parser? If so, then I've seen nice things said about xerces and I've used POCO XML(its pretty good to if you don't mind the rest of POCO tagging along)

Hippiehunter
Wouldn't want to rely on an XML parser for anything other than validated xhtml documents.
jdigital
+2  A: 

I don't know what your definition of 'light weight' is, but I really like Qt's DOM Parser.

For example, to get all images on the page:

// DO NOT USE
QDomDocument doc;
doc.setContent( HTML );
QDomNodeList imageElems = doc.elementsByTagName( "img" );
for( unsigned int i = 0; i < imageElems.length(); ++i )
{
    QDomElement e = imageElems.item(i).toElement();
    /* deal with e. Access attributes of img element */
}

It works surprisingly well and allows you to use other Qt libraries if you so desire.

EDIT: Due to the comment below, I've updated this answer.

If you're using Qt 4.6, you can use the QWebElement. A simple example:

frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList<QWebElement> imgs = document.findAll("img");

Here is another example.

Nick Presta
Both QDom and TinyXML++ are XML parsers, they don't support HTML. Even valid things like an <img> without </img>.
Nicolás
+1  A: 

Check out: http://www.codeproject.com/KB/library/GomzyHTMLReader.aspx.

It requires MFC, however.

MFC is a crap. Using this should be forbidden a long time ago :)
Piotr Dobrogost
+5  A: 

Such interesting and useful topic and almost no answers. Really strange...

It's hard to find good C++ HTML parser (how do all these browsers handle this, then? :). That's the impression I got myself when I've been looking for any such a library. However, there are good C++ XML parsers (Xerces mentioned by Hippiehunter) which you can use to parse HTML after you convert it to XML first. This conversion could be done using libs like tidyHTML - http://tidy.sourceforge.net (free) HTML-to-XML - http://www.chilkatsoft.com/html-to-xml-features.asp (commercial)

Pros: Many html docs in the internet are not a valid html at all. If you're going to reuse some html docs downloaded from untrusted source (the internet) converting them to a valid xml can save you much time each time you need to use it again.

Cons: Converting html to xml requires parsing of html so you could use converter alone if it gives you a way to read a parse tree. If you're not going to need the same html in the future you can skip conversion altogether.

Piotr Dobrogost
A: 

wxWidgets sucks in Win7, render bugs.

HTMLwiz
This should be a comment to @codelogic.
Potatoswatter