ansaurus

Question

Strip everything but text from html

Answer 1

A:

Use XML :

rootNode.innerText

But your input has to be checked before as a normalized XML.

Clement Herreman 2009-09-08 13:24:32

Answer 2

A:

You can load it into XDocument/XElement object and get the Value property, it will actually returns you the inner text of the element. You'll have to do that for every element by using depth enumeration of the xml/html tree (and add spaces between every inner text node).

hello will get you "hello"
hellohello will get you "hellohello" using rootNode.innerText - that's why you'll have to use it for every node to get "hello hello".

Eran Betzalel 2009-09-08 13:31:07

Answer 3

A:

use

var a = new Regex("<[^>]+/?>"); var v = a.Replace("my dirty text here", "");

v will now contain the text without attributes and tags.

Esben Skov Pedersen 2009-09-08 13:31:50

Answer 4

+2 A:

Assuming this is html (not xhtml), I would use the HTML Agility Pack to parse it, and access InnerText:

static void Main()
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(@"&#xD;&#xA;      <p>&#xD;&#xA;      <strong>text text. more text</strong>&#xD;&#xA;      <a href=""http://blabla&gt;blabla&lt;/a&gt; even more text...");
    string s = doc.DocumentNode.InnerText;
    // s is: &#xD;&#xA;      &#xD;&#xA;      text text. more text&#xD;&#xA;     
}

Marc Gravell 2009-09-08 13:36:27

Answer 5

A:

I've been using Regular expressions to filter HTML from a web page to retrieve only the text itself, like this:

Regex.Replace(requestHtml, "<.*?>", string.Empty)

armannvg 2009-09-21 13:27:59

ansaurus

tags:

views:

answers:

Strip everything but text from html

related questions