views:

2700

answers:

5

Does anyone have a simple, efficient way of checking that a string doesn't contain HTML? Basically, I want to check that certain fields only contain plain text. I thought about looking for the < character, but that can easily be used in plain text. Another way might be to create a new System.Xml.Linq.XElement using:

XElement.Parse("<wrapper>" + MyString + "</wrapper>")

and check that the XElement contains no child elements, but this seems a little heavyweight for what I need.

+5  A: 

The following will match any matching set of tags. i.e. <b>this</b>

Regex tagRegex = new Regex(@"<\s*([^ >]+)[^>]*>.*?<\s*/\s*\1\s*>");

The following will match any single tag. i.e. <b> (it doesn't have to be closed).

Regex tagRegex = new Regex(@"<[^>]+>");

You can then use it like so

bool hasTags = tagRegex.IsMatch(myString);
ICR
+1  A: 

Here you go:

using System.Text.RegularExpressions;
private bool ContainsHTML(string CheckString)
{
  return Regex.IsMatch(CheckString, "<(.|\n)*?>");
}

That is the simplest way, since items in brackets are unlikely to occur naturally.

Josef
+2  A: 

Angle brackets may not be your only challenge. Other characters can also be potentially harmful script injection. Such as the common double hyphen "--", which can also used in SQL injection. And there are others.

On an ASP.Net page, if validateRequest = true in machine.config, web.config or the page directive, the user will get an error page stating "A potentially dangerous Request.Form value was detected from the client" if an HTML tag or various other potential script-injection attacks are detected. You probably want to avoid this and provide a more elegant, less-scary UI experience.

You could test for both the opening and closing tags <> using a regular expression, and allow the text if only one of them occcurs. Allow < or >, but not < followed by some text and then >, in that order.

You could allow angle brackets and HtmlEncode the text to preserve them when the data is persisted.

DOK
If your strategy for dealing with SQL injection is stripping "--" out of input, you have a bigger problem.
Robert Rossney
Excellent point, Robert, but I didn't think this was the place to launch into a full explanation of defense against SQL injection, or other script injection techniques. My first line of defense against SQL injection is using parameterized SQL. What's yours?
DOK
+5  A: 

You could ensure plain text by encoding the input using HttpUtility.HtmlEncode.

In fact, depending on how strict you want the check to be, you could use it to determine if the string contains HTML:

bool bContainsHTML = (myString != HttpUtility.HtmlEncode(myString));
J c
A: 

I just tried my XElement.Parse solution. I created an extension method on the string class so I can reuse the code easily:

 public static bool ContainsXHTML(this string input)
 {
  try
  {
   XElement x = XElement.Parse("<wrapper>" + input + "</wrapper>");
   return !(x.DescendantNodes().Count() == 1 && x.DescendantNodes().First().NodeType == XmlNodeType.Text);
  }
  catch (XmlException ex)
  {
   return true;
  }
 }

One problem I found was that plain text ampersand and less than characters cause an XmlException and indicate that the field contains HTML (which is wrong). To fix this, the input string passed in first needs to have the ampersands and less than characters converted to their equivalent XHTML entities. I wrote another extension method to do that:

 public static string ConvertXHTMLEntities(this string input)
 {
  // Convert all ampersands to the ampersand entity.
  string output = input;
  output = output.Replace("&amp;", "amp_token");
  output = output.Replace("&", "&amp;");
  output = output.Replace("amp_token", "&amp;");

  // Convert less than to the less than entity (without messing up tags).
  output = output.Replace("< ", "&lt; ");

  return output;
 }

Now I can take a user submitted string and check that it doesn't contain HTML using the following code:

bool ContainsHTML = UserEnteredString.ConvertXHTMLEntities().ContainsXHTML();

I'm not sure if this is bullet proof, but I think it's good enough for my situation.

Ben Mills
You're checking to make sure that it doesn't contain XHTML. You're not checking to make sure that it doesn't contain HTML, which doesn't have to be well-formed XML. Also, your code will not catch "<b></b><b>this is XHTML</b>".
Robert Rossney
Actually, old style HTML that is not well formed XML will cause the XElement.Parse method to fail. My method assumes that the Parse method failing means that the string contains some form of HTML. I guess my code really looks for any form of tags.
Ben Mills