views:

245

answers:

3
+1  Q: 

Parse html using C

I've been a big fan of the site and frankly this is the first time I ever came across a problem that Stackoverflow didn't have the answer to.

I need to grab some content from an html(xhtml valid) page. I grab the page using curl and store it in memory. I played with the idea of using regex with the PCRE library, but simply I couldn't find any examples using it with C. Then I moved on to look at html parsers and again there is not a good selection. All I could find was a skimpy documented module for libxml called HTMLparser.

Are there any alternatives? If not, then examples for what I found already?

+1  A: 

If you want to parse XML using C, then by far the best way to proceed is to use the LibXML library. The main page is at http://xmlsoft.org/. In addition to their downloads, they have explicit code examples that specfically show how to handle parsing. I know for a fact you can get versions precompiled for Mac and Windows, most Linux and BSD distributions have it already included, and you can build from source if you wish.

Tony Miller
Good choice, but it will choke on broken html, so I'd run it through libtidy first.
Michael Krelin - hacker
+1  A: 

I would use libhtmltidy + whatever xml parser like expat or libxml. Depends on what you're looking for.

Michael Krelin - hacker
+1  A: 

You want to use HTML tidy to do this. The Lib curl page has some source code to get you going. Documents traversing the dom tree. You don't need an xml parser. Doesn't fail on badly formated html.

http://curl.haxx.se/libcurl/c/htmltidy.html

Byron Whitlock
This is what I ended up implementing. I didn't feel the need to pull out a hungry xml parser to just grab a single line of text. Thanks
Idfy