ansaurus

Question

Answer 1

+3 A:

libxml is a pretty complex beast, you should probably read the docs/API reference inside out. They have examples there. As with any open source C library, you should probably look at the source code (at least the header files). Parsing HTML is not an easy task, in fact it is incredibly difficult which is why it's so very hard to make a browser.

apphacker 2009-04-28 22:44:48

Answer 2

+1 A:

Using libxml wont be a problem, assuming all of your documents are valid XHTML, which is a subset of XML. If your documents are HTML (which includes "bad" things like the <br> tag) then libxml will probably fail to validate and die because of the lack of closing tags (the same also goes for people who don't close <p>).

MighMoS 2009-04-29 01:40:37

The LibXML website says (at http://xmlsoft.org/html/libxml-HTMLparser.html) that it can parse bad HTML.

Chris Lutz 2009-09-27 03:59:39

Answer 3

+1 A:

What programming language do you intend on using? You should run it through an HTML parser before you try to grok it with XML. If you use something like Tidy or TagSoup to clean it up first it will make your life much easier.

Trey 2009-04-29 02:01:40

Answer 4

A:

If you aren't bound to C++ and if you are familiar with Python give BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) a try. Easy to use and yet very powerful.

da8 2009-08-10 10:34:06

Answer 5

+4 A:

I actually did this recently and it is fairly straightforward. This is mostly from memory so it may not be 100% correct:

void FindLinks(htmlNodePtr element)
{
    for(htmlNodePtr node = element; node != NULL; node = node->next)
    {
        if(node->type == XML_ELEMENT_NODE)
        {
            if(xmlStrcasecmp(node->name, (const xmlChar*)"A") == 0)
            {
                for(xmlAttrPtr attr = node->properties; attr != NULL; attr = attr->next)
                {
                    if(xmlStrcasecmp(attr->name, (const xmlChar*)"HREF") == 0)
                    {
                        printf("Found link <%s>\n", node->children->content);
                    }
                }
            }
            if(node->children != NULL)
            {
                FindLinks(node->children);
            }
        }
    }
}

void ParseHTML(xmlChar* html)
{
    htmlDocPtr doc = htmlParseDoc(html, NULL);
    if(doc != NULL)
    {
        htmlNodePtr root = xmlDocGetRootElement(doc);
        if(root != NULL)
        {
            FindLinks(root);
        }
        xmlFreeDoc(doc);
        doc = NULL;
    }
}

Luke 2009-09-27 03:47:55

Answer 6

A:

LibXML can parse bad HTML. I did this in ruby and I specified the option RECOVER to the initialize method, and it parsed correctly. I guess for C++ you need to somehow pass the RECOVER option in the constructor.

goaasim 2010-10-26 01:13:39

ansaurus

tags:

views:

answers:

html parsing with libxml

related questions