It's relatively simple if you load the HTML into C# and then using the mshtml.dll or the WebBrowser control in C#/WinForms, you can then treat the entire HTML document as a tree, traverse the tree capturing the InnerText objects.
Or, you could also use document.all, which takes the tree, flattens it, and then you can iterate across the tree, again capturing the InnerText.
Here's an example:
WebBrowser webBrowser = new WebBrowser();
webBrowser.Url = new Uri("url_of_file"); //can be remote or local
webBrowser.DocumentCompleted += delegate
{
HtmlElementCollection collection = webBrowser.Document.All;
List<string> contents = new List<string>();
/*
* Adds all inner-text of a tag, including inner-text of sub-tags
* ie. <html><body><a>test</a><b>test 2</b></body></html> would do:
* "test test 2" when collection[i] == <html>
* "test test 2" when collection[i] == <body>
* "test" when collection[i] == <a>
* "test 2" when collection[i] == <b>
*/
for (int i = 0; i < collection.Count; i++)
{
if (!string.IsNullOrEmpty(collection[i].InnerText))
{
contents.Add(collection[i].InnerText);
}
}
/*
* <html><body><a>test</a><b>test 2</b></body></html>
* outputs: test test 2|test test 2|test|test 2
*/
string contentString = string.Join("|", contents.ToArray());
MessageBox.Show(contentString);
};
Hope that helps!