tags:

views:

112

answers:

1

hello, is there is any good parser in c or c++ for extracting tags and links from html page ...

+5  A: 

I'd recommend libxml2. It's written in C and has C++ wrappers and other language bindings. It has an HTMLParser module that can parse "real world" htmls. http://www.xmlsoft.org/html/libxml-HTMLparser.html It supports DOM transversal and XPath/XPointer which allows you to easily locate the element you are interested in.

While the API documentation looks like commented header files, there some introductory pages and code examples on the official site that can help you get started with the library. There are other great tutorials around the web too. e.g. http://student.santarosa.edu/~dturover/?node=libxml2

My personal experience is that the API interface provided by libxml2 is not very user friendly. I had to dive into the source to really understand its design and best usage. Fortunately the code is clearly written and well commented.

Arrix