tags:

views:

113

answers:

5

What is the best way to take a string of HTML and turn it in to something useful?

Essentially if i take a URL and go get the HTML from that URL in .net i get a response but this would come in the form of either a file or stream or string.

What if i want an actual document or something I can crawl like an XmlDocument object?

I have some thoughts and an already implemented solution on this but I am interested to see what the community thinks about this.

+6  A: 

HTML pages are rarely valid XML even if written in XHTML, so they cannot be loaded in to a standard XML object.

Take a look at the HTML Agility Pack. This .net component will allow you to traverse the DOM even if it is not valid.

Jens
Thats why I said "like an xml document" as in similar to ... I know this only too well.
Wardy
+2  A: 

I use the mshtml api.

simply refer to the mshtml assembly then include the namespace.

from there you can declare a HTMLDocument object which is queryable, its a bit of headache in places because the API design forces you to do random casting but it does get the job done and it can always be put in to a util class on it's own so you don't have to keep your oddities in your main app code classes.

+1  A: 
var browser = new System.Windows.Forms.WebBrowser();
browser.Navigate(new System.Uri("http://example.com"));
var doc = browser.Document;

HtmlDocument has a number of useful members

For example, doc.All which is HtmlControlCollection which can become a generic collection ICollection<HtmlControl>.

HtmlControl.DomElement refers to mshtml namespace mentioned in another answer.

Some usage example you can find in the source of this project

abatishchev
simple ... very simple ... but try this ... 1. Create a new console app2. Put that code in it 3. add a reference to System.Windows.Forms4. Run it.This sample seems to break, using the mshtml api doesn't, not sure about the agility pack though.
Wardy
@Wardy: WebBrowser control doesn't work in console application because it's a wrapper on COM object which can't run in STA mode
abatishchev
Exactly, I have code that works as part of a standalone assembly, I simply refer to it and use it as required, the best solution is always a nice clean portable one :)
Wardy
@Wardy: Hi, Wardy. Have you any success in your question? :)
abatishchev
I've marked what I believe is the cleanest and most flexible answer. However I do already have several solutions to this.
Wardy
+1  A: 

The easiest way is to load it into the System.Windows.Forms.HtmlDocument class. You can then access the DOM from there.

Of course you would want to look at the content-type in the HTTP response to determine if this is actually HTML (which the question referred to) or if this is perhaps binary data such as an image.

HTTP basically just spits out a raw document which is either binary data or markup text and the browser generally does the rest, using the hints it is provided in the response header. This is of course all nicely wrapped in the HTTPWebResponse clas, ready to use.

Cobusve
+2  A: 

You can use Tidy.net to format the html you get in your response. You will then be able to load that into an XmlDocument and traverse the nodes to get what you want.

Tidy document = new Tidy();
TidyMessageCollection messageCollection = new TidyMessageCollection();

document.Options.DocType = DocType.Omit;
document.Options.Xhtml = true;
document.Options.CharEncoding = CharEncoding.UTF8;
document.Options.LogicalEmphasis = true;

document.Options.MakeClean = false;
document.Options.QuoteNbsp = false;
document.Options.SmartIndent = false;
document.Options.IndentContent = false;
document.Options.TidyMark = false;

document.Options.DropFontTags = false;
document.Options.QuoteAmpersand = true;
document.Options.DropEmptyParas = true;

MemoryStream input = new MemoryStream();
MemoryStream output = new MemoryStream();
byte[] array = Encoding.UTF8.GetBytes(xmlResult);
input.Write(array, 0, array.Length);
input.Position = 0;

document.Parse(input, output, messageCollection);

string tidyXhtml = Encoding.UTF8.GetString(output.ToArray());

XmlDocument outputXml = new XmlDocument();
outputXml.LoadXml((tidyXhtml);
skyfoot