tags:

views:

454

answers:

4

I have RSS feed that I want to modify on fly, all I need is the text (and linefeeds) so everything else must be removed ( All images, styles, links )

How can I do this easily with ASP.NET c#

A: 
string pattern = @"<(.|\n)*?>";
return Regex.Replace(htmlString, pattern, string.Empty);
Tom
Fails for attributes with ‘>’ in value, comments, PIs, etc.
bobince
A: 

Be careful - you don't want to assume that the html you receive is well formed:

public static string ClearHTMLTagsFromString(string htmlString)
{
    string regEx = @"\<[^\<\>]*\>";
    string tagless = Regex.Replace(htmlString, regEx, string.Empty);

    // remove rogue leftovers
    tagless = tagless.Replace("<", string.Empty).Replace(">", string.Empty);

    return tagless;
}
teedyay
A: 

I did this in JavaScript for a project in much the same way as above:

var thisText = '';
thisText = document.getElementById('textToStrip').value;
var re = new RegExp('<(.|\\n)*?>', 'igm');
thisText = thisText.replace(re, '');
Paul Herzberg
+4  A: 

Regex cannot parse XML. Do not use regex to parse XML. Do not pass Go. Do not collect £200.

You need a proper XML parser. Load the RSS into an XMLDocument, then use innerText to get only text content.

Note that even when you've extracted the description content from RSS, it can contain active HTML. That is:

<description> &lt;em&gt;Fish&lt;/em&gt; &amp;amp; chips </description>

can, when parsed properly as XML then read as text give you either the literal string:

<em>Fish</em> &amp; chips

or, the markup:

Fish & chips

The fun thing about RSS is that you don't really know which is right. In RSS 2.0 it is explicitly HTML markup (the second case); in other versions it's not specified. Generally you should assume that descriptions can contain entity-encoded HTML tags, and if you want to further strip those from the final text you'll need a second parsing step.

(Unfortunately, since this is legacy HTML and not XML it's harder to parse; a regex will be even more useless than it is for parsing XML. There isn't a built-in HTML parser in .NET, but there are third-party libraries such as the HTML Agility Pack.)

bobince
+1 The only answer that makes sense so far and it gets downvoted ...
unbeknown
The regex culture war rages on!
bobince