tags:

views:

158

answers:

3

I am looking for a C# library that would translate the HTML code (and the css specified in the code) into a DOM tree for simpler parsing. I am looking for something similar to this one (which is in PHP): http://simplehtmldom.sourceforge.net/

Of course I know I could embed a browser control, but I am looking for something more efficient.

+2  A: 

Check out the HTML Agility Pack. It hasn't been updated in a while, but it still works very well.

Scott Dorman
I was actually just looking at it. I am not seeing support for CSS styles. So if I have a DIV, and it has a class with background-image set to bg1.gif , is it parsing it?
webly
@webly: what would it look like if it did? Do you expect to see css properties as dom elements or attributes, or something different?
Stobor
@webly: No, I don't believe it supports CSS. It will parse the actual HTML but any CSS related metadata probably just gets lost. I'm not sure how difficult it would be to modify the code to support CSS but, as Stobor mentioned, how would you expect to see the CSS data?
Scott Dorman
i would expect it to work similar to the dom, get those attributes and have them in the tree below the object they belong to.
webly
A: 

I second Mr. Dorman on the HtmlAgilityPack. I did a brief blog post on web scraping some time ago; it mentions the 'pack, but mostly discusses other details. Depending on your application, it might be of some use.

TrueWill
A: 

We have used HTMLAgility here in our project to extract specific html tags with a given set of attributes using XPath and it has never failed us.