views:

227

answers:

7

So I just got my site kicked off the server today and I think this function is the culprit. Can anyone tell me what the problem is? I can't seem to figure it out:

Public Function CleanText(ByVal str As String) As String    
'removes HTML tags and other characters that title tags and descriptions don't like
 If Not String.IsNullOrEmpty(str) Then
  'mini db of extended tags to get rid of
  Dim indexChars() As String = {"<a", "<img", "<input type=""hidden"" name=""tax""", "<input type=""hidden"" name=""handling""", "<span", "<p", "<ul", "<div", "<embed", "<object", "<param"}

  For i As Integer = 0 To indexChars.GetUpperBound(0) 'loop through indexchars array
   Dim indexOfInput As Integer = 0
   Do 'get rid of links
    indexOfInput = str.IndexOf(indexChars(i)) 'find instance of indexChar
    If indexOfInput <> -1 Then
     Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput) + 1
     Dim indexRightBracket As Integer = str.IndexOf(">", indexOfInput) + 1
     'check to make sure a right bracket hasn't been left off a tag
     If indexNextLeftBracket > indexRightBracket Then 'normal case
      str = str.Remove(indexOfInput, indexRightBracket - indexOfInput)
     Else
      'add the right bracket right before the next left bracket, just remove everything
      'in the bad tag
      str = str.Insert(indexNextLeftBracket - 1, ">")
      indexRightBracket = str.IndexOf(">", indexOfInput) + 1
      str = str.Remove(indexOfInput, indexRightBracket - indexOfInput)
     End If
    End If
   Loop Until indexOfInput = -1
  Next
 End If
 Return str
End Function
A: 

Just a guess, but is this like the culprit? indexOfInput = str.IndexOf(indexChars(i)) 'find instance of indexChar

Per the Microsoft docs, Return Value - The index position of value if that string is found, or -1 if it is not. If value is Empty, the return value is 0.

So perhaps indexOfInput is being set to 0?

JonnyBoats
the first line of the function is: If Not String.IsNullOrEmpty(str), which would take care of that case...
Jason
A: 

What happens if your code tries to clean the string <a?

As I read it, it finds the indexChar at position 0, but then indexNextLeftBracket and indexRightBracket both equal 0, you fall into the else condition, and then you insert a ">" at position -1, which will presumably insert at the beginning, giving you the string ><a. The new indexRightBracket then becomes 0, so you delete from position 0 for 0 characters, leaving you with ><a. Then the code finds the <a in the code again, and you're off to the races with an infinite memory-consuming loop.

Even if I'm wrong, you need to get yourself some unit tests to reassure yourself that these edge cases work properly. That should also help you find the actual looping code if I'm off-base.

Generally speaking though, even if you fix this particular bug, it's never going to be very robust. Parsing HTML is hard, and HTML blacklists are always going to have holes. For instance, if I really want to get a <input type="hidden" name="tax" tag in, I'll just write it as <input name="tax" type="hidden" and your code will ignore it. Your better bet is to get an actual HTML parser involved, and to only allow the (very small) subset of tags that you actually want. Or even better, use some other form of markup, and strip all HTML tags (again using a real HTML parser of some description).

C Pirate
A: 

I'd have to run it through a real compiler but the mindpiler tells me that the str = str.Remove(indexOfInput, indexRightBracket - indexOfInput) line is re-generating an invalid tag such that when you loop through again it finds the same mistake "fixes" it, tries again, finds the mistake "fixes" it, etc.

FWIW heres a snippet of code that removes unwanted HTML tags from a string (It's in C# but the concept translates)

public static string RemoveTags( string html, params string[] allowList )
{
    if( html == null ) return null;
    Regex regex = new Regex( @"(?<Tag><(?<TagName>[a-z/]+)\S*?[^<]*?>)",
                             RegexOptions.Compiled | 
                             RegexOptions.IgnoreCase | 
                             RegexOptions.Multiline );
    return regex.Replace( 
                   html, 
                   new MatchEvaluator( 
                       new TagMatchEvaluator( allowList ).Replace ) );
}

MatchEvaluator class

private class TagMatchEvaluator
{
    private readonly ArrayList _allowed = null;

    public TagMatchEvaluator( string[] allowList ) 
    { 
        _allowed = new ArrayList( allowList ); 
    }

    public string Replace( Match match )
    {
        if( _allowed.Contains( match.Groups[ "TagName" ].Value ) )
            return match.Value;
        return "";
    }
}
Paul Alexander
mindpiler, heh.
Carson McComas
+5  A: 

Wouldn't something like this be simpler? (OK, I know it's not identical to posted code):

public string StripHTMLTags(string text)
{
    return Regex.Replace(text, @"<(.|\n)*?>", string.Empty);
}

(Conversion to VB.NET should be trivial!)

Note: if you are running this often, there are two performance improvements you can make to the Regex.

One is to use a pre-compiled expression which requires re-writing slightly.

The second is to use a non-capturing form of the regular expression; .NET regular expressions implement the (?:) syntax, which allows for grouping to be done without incurring the performance penalty of captured text being remembered as a backreference. Using this syntax, the above regular expression could be changed to:

@"<(?:.|\n)*?>"
Mitch Wheat
He's not stripping all tags, or even all tags of a certain type, but yes this is probably much simpler.
Joel Coehoorn
I just noticed that and edited my post and then saw your comment appear!
Mitch Wheat
A: 

That doesn't seem to work for a simplistic <a<a<a case, or even <a>Test</a>. Did you test this at all?

Personally, I hate string parsing like this - so I'm not going to even try figuring out where your error is. It'd require a debugger, and more headache than I'm willing to put in.

Mark Brackett
+1  A: 

In addition to other good answers, you might read up a little on loop invariants a little bit. The pulling out and putting back stuff to the string you check to terminate your loop should set off all manner of alarm bells. :)

JP Alioto
+3  A: 

This line is also wrong:

Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput) + 1

It's guaranteed to always set indexNextLeftBracket equal to indexOfInput, because at this point the character at the position referred to by indexOfInput is already always a '<'. Do this instead:

Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput+1) + 1

And also add a clause to the if statement to make sure your string is long enough for that expression.

Finally, as others have said this code will be a beast to maintain, if you can get it working at all. Best to look for another solution, like a regex or even just replacing all '<' with &lt;.

Joel Coehoorn
+1. well spotted!
Mitch Wheat