ansaurus

Question

Answer 1

+3 A:

You could try wxHTML.

codelogic 2009-01-28 21:53:38

Answer 2

+1 A:

depending on the kind of HTML you're parsing, can you really get by with an XML parser? If so, then I've seen nice things said about xerces and I've used POCO XML(its pretty good to if you don't mind the rest of POCO tagging along)

Hippiehunter 2009-01-28 22:13:02

Wouldn't want to rely on an XML parser for anything other than validated xhtml documents.

jdigital 2009-01-28 23:01:44

Answer 3

+2 A:

~~I don't know what your definition of 'light weight' is, but I really like Qt's DOM Parser.~~

~~For example, to get all images on the page:~~

// DO NOT USE
QDomDocument doc;
doc.setContent( HTML );
QDomNodeList imageElems = doc.elementsByTagName( "img" );
for( unsigned int i = 0; i < imageElems.length(); ++i )
{
    QDomElement e = imageElems.item(i).toElement();
    /* deal with e. Access attributes of img element */
}

~~It works surprisingly well and allows you to use other Qt libraries if you so desire.~~

EDIT: Due to the comment below, I've updated this answer.

If you're using Qt 4.6, you can use the QWebElement. A simple example:

frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList<QWebElement> imgs = document.findAll("img");

Here is another example.

Nick Presta 2009-01-29 03:34:52

Both QDom and TinyXML++ are XML parsers, they don't support HTML. Even valid things like an <img> without </img>.

Nicolás 2010-04-17 01:37:48

Answer 4

+1 A:

Check out: http://www.codeproject.com/KB/library/GomzyHTMLReader.aspx.

It requires MFC, however.

2009-02-18 16:08:42

MFC is a crap. Using this should be forbidden a long time ago :)

Piotr Dobrogost 2009-05-11 13:36:41

Answer 5

+5 A:

Such interesting and useful topic and almost no answers. Really strange...

It's hard to find good C++ HTML parser (how do all these browsers handle this, then? :). That's the impression I got myself when I've been looking for any such a library. However, there are good C++ XML parsers (Xerces mentioned by Hippiehunter) which you can use to parse HTML after you convert it to XML first. This conversion could be done using libs like tidyHTML - http://tidy.sourceforge.net (free) HTML-to-XML - http://www.chilkatsoft.com/html-to-xml-features.asp (commercial)

Pros: Many html docs in the internet are not a valid html at all. If you're going to reuse some html docs downloaded from untrusted source (the internet) converting them to a valid xml can save you much time each time you need to use it again.

Cons: Converting html to xml requires parsing of html so you could use converter alone if it gives you a way to read a parse tree. If you're not going to need the same html in the future you can skip conversion altogether.

Piotr Dobrogost 2009-04-30 15:40:45

Answer 6

A:

wxWidgets sucks in Win7, render bugs.

HTMLwiz 2010-08-22 23:20:33

This should be a comment to @codelogic.

Potatoswatter 2010-08-22 23:27:25

ansaurus

tags:

views:

answers:

Library Recommendation: C++ HTML Parser

related questions