views:

2383

answers:

6

In another thread I got convinced into using HTML parsers instead of regexps for HTML parsing. I thought of using libxml (it has some HTML parser built in), but failed to find any useful tutorial. I also found this site and it says here it should do fine even with severely broken HTML.

Could you give me some examples of HTML parsing with libxml, or maybe recommend some different free library for Linux? I'm using C++.

I just thought someone would have some example code, so that I don't have to analyze the headers. ;)

+3  A: 

libxml is a pretty complex beast, you should probably read the docs/API reference inside out. They have examples there. As with any open source C library, you should probably look at the source code (at least the header files). Parsing HTML is not an easy task, in fact it is incredibly difficult which is why it's so very hard to make a browser.

apphacker
+1  A: 

Using libxml wont be a problem, assuming all of your documents are valid XHTML, which is a subset of XML. If your documents are HTML (which includes "bad" things like the <br> tag) then libxml will probably fail to validate and die because of the lack of closing tags (the same also goes for people who don't close <p>).

MighMoS
The LibXML website says (at http://xmlsoft.org/html/libxml-HTMLparser.html) that it can parse bad HTML.
Chris Lutz
+1  A: 

What programming language do you intend on using? You should run it through an HTML parser before you try to grok it with XML. If you use something like Tidy or TagSoup to clean it up first it will make your life much easier.

Trey
A: 

If you aren't bound to C++ and if you are familiar with Python give BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) a try. Easy to use and yet very powerful.

da8
+4  A: 

I actually did this recently and it is fairly straightforward. This is mostly from memory so it may not be 100% correct:

void FindLinks(htmlNodePtr element)
{
    for(htmlNodePtr node = element; node != NULL; node = node->next)
    {
        if(node->type == XML_ELEMENT_NODE)
        {
            if(xmlStrcasecmp(node->name, (const xmlChar*)"A") == 0)
            {
                for(xmlAttrPtr attr = node->properties; attr != NULL; attr = attr->next)
                {
                    if(xmlStrcasecmp(attr->name, (const xmlChar*)"HREF") == 0)
                    {
                        printf("Found link <%s>\n", node->children->content);
                    }
                }
            }
            if(node->children != NULL)
            {
                FindLinks(node->children);
            }
        }
    }
}

void ParseHTML(xmlChar* html)
{
    htmlDocPtr doc = htmlParseDoc(html, NULL);
    if(doc != NULL)
    {
        htmlNodePtr root = xmlDocGetRootElement(doc);
        if(root != NULL)
        {
            FindLinks(root);
        }
        xmlFreeDoc(doc);
        doc = NULL;
    }
}
Luke
A: 

LibXML can parse bad HTML. I did this in ruby and I specified the option RECOVER to the initialize method, and it parsed correctly. I guess for C++ you need to somehow pass the RECOVER option in the constructor.

goaasim