views:

393

answers:

3

My question is sort of like this question but I have more constraints:

  • I know the document's are reasonably sane
  • they are very regular (they all came from the same source
  • I want about 99% of the visible text
  • about 99% of what is viable at all is text (they are more or less RTF converted to HTML)
  • I don't care about formatting or even paragraph breaks.

Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?

I'm open to command line or batch processing tools as well as C/C#/D libraries.

+4  A: 

You need to use the HTML Agility Pack.

You probably want to find an element using LINQ ant the Descendants call, then get its InnerText.

SLaks
You mean I need to learn LINQ? (surprisingly, this really is the first thing I've run into were LINQ sounds like the right way to go, but then again, I'm not usually in this domain)
BCS
@BCS: You don't _need_ to learn LINQ, but LINQ makes it much easier to use. I would guess that using LINQ effectively would make your code at least 120% shorter, and easier to understand, too.
SLaks
Wow my code is -20 lines of code! ;)
BCS
+1 The agility pack is so much better than writing your own DOM processing program.
Ioxp
As it happens, LINQ wasn't the easiest solution, but only because there is an example project html2text that did 90% of what I wanted and the last 1% was trivial to add as a few lines of `if(...) return;` (OTOH the documentation wasn't so good.)
BCS
A: 

It's relatively simple if you load the HTML into C# and then using the mshtml.dll or the WebBrowser control in C#/WinForms, you can then treat the entire HTML document as a tree, traverse the tree capturing the InnerText objects.

Or, you could also use document.all, which takes the tree, flattens it, and then you can iterate across the tree, again capturing the InnerText.

Here's an example:

        WebBrowser webBrowser = new WebBrowser();
        webBrowser.Url = new Uri("url_of_file"); //can be remote or local
        webBrowser.DocumentCompleted += delegate
        {
            HtmlElementCollection collection = webBrowser.Document.All;
            List<string> contents = new List<string>();

            /*
             * Adds all inner-text of a tag, including inner-text of sub-tags
             * ie. <html><body><a>test</a><b>test 2</b></body></html> would do:
             * "test test 2" when collection[i] == <html>
             * "test test 2" when collection[i] == <body>
             * "test" when collection[i] == <a>
             * "test 2" when collection[i] == <b>
             */
            for (int i = 0; i < collection.Count; i++)
            {
                if (!string.IsNullOrEmpty(collection[i].InnerText))
                {
                    contents.Add(collection[i].InnerText);
                }
            }

            /*
             * <html><body><a>test</a><b>test 2</b></body></html>
             * outputs: test test 2|test test 2|test|test 2
             */
            string contentString = string.Join("|", contents.ToArray());
            MessageBox.Show(contentString);
        };

Hope that helps!

AlishahNovin
Googleing for mshtml.dll give most of a page or bug reports, bug fix and errors. --- Do you have a link to some documentation?
BCS
I just edited my post with an example using the WebBrowser control.
AlishahNovin
A: 

Here you can download a tool and its source that converts to and fro HTML and XAML: XAML/HTML converter.

It contains a HTML parser (such a thing must obviously be much more tolerant than your standard XML parser) and you can traverse the HTML much similar to XML.

herzmeister der welten