tags:

views:

276

answers:

5

Hi

Our CMS allows users to enter text using a html editor, so when reading text into the webpage I can text like this:

&#xD;&#xA;   <p>&#xD;&#xA;      <strong>text text. more 
text</strong>&#xD;&#xA;      <a href="http://blabla&gt;blabla&lt;/a&gt; even more text...

How can I strip everything but text including , and . and similar characters ?

A: 

Use XML :

rootNode.innerText

But your input has to be checked before as a normalized XML.

Clement Herreman
A: 

You can load it into XDocument/XElement object and get the Value property, it will actually returns you the inner text of the element. You'll have to do that for every element by using depth enumeration of the xml/html tree (and add spaces between every inner text node).

  • <P>hello</P> will get you "hello"
  • <P>hello</P><P>hello</P> will get you "hellohello" using rootNode.innerText - that's why you'll have to use it for every node to get "hello hello".
Eran Betzalel
A: 

use

var a = new Regex("<[^>]+/?>"); var v = a.Replace("my dirty text here", "");

v will now contain the text without attributes and tags.

Esben Skov Pedersen
+2  A: 

Assuming this is html (not xhtml), I would use the HTML Agility Pack to parse it, and access InnerText:

static void Main()
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(@"&#xD;&#xA;      <p>&#xD;&#xA;      <strong>text text. more text</strong>&#xD;&#xA;      <a href=""http://blabla&gt;blabla&lt;/a&gt; even more text...");
    string s = doc.DocumentNode.InnerText;
    // s is: &#xD;&#xA;      &#xD;&#xA;      text text. more text&#xD;&#xA;     
}
Marc Gravell
A: 

I've been using Regular expressions to filter HTML from a web page to retrieve only the text itself, like this:

Regex.Replace(requestHtml, "<.*?>", string.Empty)
armannvg