Preferably a light weight HTML parser, not exactly creating a browser or looking to modulate JS or any http connections.
depending on the kind of HTML you're parsing, can you really get by with an XML parser? If so, then I've seen nice things said about xerces and I've used POCO XML(its pretty good to if you don't mind the rest of POCO tagging along)
I don't know what your definition of 'light weight' is, but I really like Qt's DOM Parser.
For example, to get all images on the page:
// DO NOT USE
QDomDocument doc;
doc.setContent( HTML );
QDomNodeList imageElems = doc.elementsByTagName( "img" );
for( unsigned int i = 0; i < imageElems.length(); ++i )
{
QDomElement e = imageElems.item(i).toElement();
/* deal with e. Access attributes of img element */
}
It works surprisingly well and allows you to use other Qt libraries if you so desire.
EDIT: Due to the comment below, I've updated this answer.
If you're using Qt 4.6, you can use the QWebElement. A simple example:
frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList<QWebElement> imgs = document.findAll("img");
Here is another example.
Check out: http://www.codeproject.com/KB/library/GomzyHTMLReader.aspx.
It requires MFC, however.
Such interesting and useful topic and almost no answers. Really strange...
It's hard to find good C++ HTML parser (how do all these browsers handle this, then? :). That's the impression I got myself when I've been looking for any such a library. However, there are good C++ XML parsers (Xerces mentioned by Hippiehunter) which you can use to parse HTML after you convert it to XML first. This conversion could be done using libs like tidyHTML - http://tidy.sourceforge.net (free) HTML-to-XML - http://www.chilkatsoft.com/html-to-xml-features.asp (commercial)
Pros: Many html docs in the internet are not a valid html at all. If you're going to reuse some html docs downloaded from untrusted source (the internet) converting them to a valid xml can save you much time each time you need to use it again.
Cons: Converting html to xml requires parsing of html so you could use converter alone if it gives you a way to read a parse tree. If you're not going to need the same html in the future you can skip conversion altogether.