views:

37

answers:

2

I need to create a data index of HTML pages provided to a service by essentially grabbing all text on them and putting them in a string to go into a storage system.

If this were GUI based, I would simply Ctrl+A on the HTML page, copy it, then go to Notepad and Ctrl+V. Simples. If I can do it via good old point n' click, then surely there must be a way to do it programmatically, but I'm struggling to find anything useful.

The HTML docs in question are being loaded for rendering currently using the System.Windows.Controls.WebBrowser class, so I wonder if its somehow possible to grab the data from there?

I'm going to keep hunting, but any pointers would be very appreciated.

Note: We don't want the HTML source code, and would also really rather not have to parse all the source code to get the text unless we absolutely have to.

A: 

This looks like it might be quite helpful.

adam
+1  A: 

If I understand your problem correctly, you will have to do a bit of work to get the data.

WebBrowser browser=new WebBrowser();  // This is what you have
HtmlDocument doc = browser.Document;  // This gives you the browser contents
String content = 
    (((mshtml.HTMLDocumentClass)(doc.DomDocument)).documentElement).innerText;

That last line is the browser's view of the rendered content.

Dekker500
This is helpful, with the only issue being that we're currently using a Control.WebBrowser, not a Forms.WebBrowser. However, it seems by far the best bet so far, I'll see what I can do with it. Thank you :)
Smallgods