ansaurus

Question

How to extract text from resonably sane HTML?

Answer 1

+4 A:

You probably want to find an element using LINQ ant the Descendants call, then get its InnerText.

SLaks 2010-01-21 23:08:00

You mean I need to learn LINQ? (surprisingly, this really is the first thing I've run into were LINQ sounds like the right way to go, but then again, I'm not usually in this domain)

BCS 2010-01-21 23:15:34

@BCS: You don't _need_ to learn LINQ, but LINQ makes it much easier to use. I would guess that using LINQ effectively would make your code at least 120% shorter, and easier to understand, too.

SLaks 2010-01-21 23:31:47

Wow my code is -20 lines of code! ;)

BCS 2010-01-21 23:59:01

+1 The agility pack is so much better than writing your own DOM processing program.

Ioxp 2010-01-22 16:06:40

As it happens, LINQ wasn't the easiest solution, but only because there is an example project html2text that did 90% of what I wanted and the last 1% was trivial to add as a few lines of `if(...) return;` (OTOH the documentation wasn't so good.)

BCS 2010-01-25 05:21:36

Answer 2

A:

It's relatively simple if you load the HTML into C# and then using the mshtml.dll or the WebBrowser control in C#/WinForms, you can then treat the entire HTML document as a tree, traverse the tree capturing the InnerText objects.

Or, you could also use document.all, which takes the tree, flattens it, and then you can iterate across the tree, again capturing the InnerText.

Here's an example:

        WebBrowser webBrowser = new WebBrowser();
        webBrowser.Url = new Uri("url_of_file"); //can be remote or local
        webBrowser.DocumentCompleted += delegate
        {
            HtmlElementCollection collection = webBrowser.Document.All;
            List<string> contents = new List<string>();

            /*
             * Adds all inner-text of a tag, including inner-text of sub-tags
             * ie. <html><body><a>test</a><b>test 2</b></body></html> would do:
             * "test test 2" when collection[i] == <html>
             * "test test 2" when collection[i] == <body>
             * "test" when collection[i] == <a>
             * "test 2" when collection[i] == <b>
             */
            for (int i = 0; i < collection.Count; i++)
            {
                if (!string.IsNullOrEmpty(collection[i].InnerText))
                {
                    contents.Add(collection[i].InnerText);
                }
            }

            /*
             * <html><body><a>test</a><b>test 2</b></body></html>
             * outputs: test test 2|test test 2|test|test 2
             */
            string contentString = string.Join("|", contents.ToArray());
            MessageBox.Show(contentString);
        };

Hope that helps!

AlishahNovin 2010-01-21 23:12:49

Googleing for mshtml.dll give most of a page or bug reports, bug fix and errors. --- Do you have a link to some documentation?

BCS 2010-01-21 23:21:23

I just edited my post with an example using the WebBrowser control.

AlishahNovin 2010-01-22 15:57:08

Answer 3

A:

Here you can download a tool and its source that converts to and fro HTML and XAML: XAML/HTML converter.

It contains a HTML parser (such a thing must obviously be much more tolerant than your standard XML parser) and you can traverse the HTML much similar to XML.

herzmeister der welten 2010-01-22 16:03:15

ansaurus

tags:

views:

answers:

How to extract text from resonably sane HTML?

related questions