tags:

views:

75

answers:

2

Duplicate: Looking for C# HTML parser. Please close.

Can you recommend me a library for reading HTML files as XML in .NET? I'd actually prefer to deal with XML objects rather than text. Ideally, it must fix HTML formatting errors.

+2  A: 

http://www.codeplex.com/htmlagilitypack

Duplicate? http://stackoverflow.com/questions/100358/looking-for-c-html-parser

BigBlondeViking
Yes, it looks like this.
Alex Yakunin
Just found SgmlReader as well: http://developer.mindtouch.com/SgmlReader
Alex Yakunin
+1  A: 

You may want to rethink this. The two are not equal.

a great example of this is self closing tags.

XML standard indicates that a self closing tag looks like the following:

<br/>

while html standards has non-content tags as single tags

<br>
<link rel="...">

In html, using the xml syntax actually is a violation, as /> has a different meaning.

There are more examples of these issues in the following article.

Tim Hoolihan
That's precisely the point of the question - he wants a library that would read HTML, with all its quirks, and expose it as well-formed XHTML. So `<br>` gets translated to `<br/>`, implicitly-closed `<p>` becomes explicitly closed, etc.
Pavel Minaev
Exactly. Thanks for the explanation ;)
Alex Yakunin