views:

540

answers:

6

I've got a fairly simple method which recursively removes begining/ending html tags

class Program
    {   
        static void Main(string[] args)
        {
            string s = FixHtml("<div><p>this is a <strong>test</strong></p></div>");
            Console.WriteLine(s);
        }

        private static string FixHtml(string s)
        {            
            //Remove any outer <div>
            if (s.ToLower().StartsWith("<div>"))
            {
                FixHtml(s.Substring(5, s.Length - 5));
            }
            else if (s.ToLower().StartsWith("<p>"))
            {
                FixHtml(s.Substring(3, s.Length - 3));
            }
            else if (s.ToLower().EndsWith("</div>"))
            {
                FixHtml(s.Substring(0, s.Length - 6));
            }
            else if (s.ToLower().EndsWith("</p>"))
            {
                FixHtml(s.Substring(0, s.Length - 4));
            }

            return s;
        }
    }

The behaviour is that it can recursively remove the <div> & <p> tags, but on the "return s" statement it undo's all the work, by adding back add the tags!

Anyone know why this happens? and how do i force it to return the value i want. i.e
this is a <strong>test</strong>

+13  A: 

In .NET, strings are immutable - so your method actually never changes the return value. When you call s.ToLower().StartsWith("<div>") you get back a new string with the expected differences. The existing string s is left unchanged.

Also, you never consume the return value from your recursive calls.

Off the top of my head, try something like this:

    private static string FixHtml(string s)
    {            
        if (s.ToLower().StartsWith("<div>"))
        {
            return FixHtml(s.Substring(5, s.Length - 5));
        }
        else if (s.ToLower().StartsWith("<p>"))
        {
            return FixHtml(s.Substring(3, s.Length - 3));
        }
        else if (s.ToLower().EndsWith("</div>"))
        {
            return FixHtml(s.Substring(0, s.Length - 6));
        }
        else if (s.ToLower().EndsWith("</p>"))
        {
            return FixHtml(s.Substring(0, s.Length - 4));
        }

        return s;
    }
Bevan
Or at least assign the return string of each FixHtml call back into the variable s.
mkmurray
@mkmurray - yep, that would work. A good alternative for those belonging to the "single exit point" camp.
Bevan
+3  A: 
  • You're not doing anything with the strings returned by the nested calls.
  • When a string is changed, a new object is created, rather than the existed one being changed (they're immutable).
  • If you wanted to take the similar approach without using a return value, you could make the string parameter a 'ref' parameter. Although the performance downfalls mentioned by others would still apply.
frou
+2  A: 

you need to add a return to each FixHtml call like this:

   private static string FixHtml(string s)
    {            
        //Remove any outer <div>
        if (s.ToLower().StartsWith("<div>"))
        {
            return FixHtml(s.Substring(5, s.Length - 5));
        }
        else if (s.ToLower().StartsWith("<p>"))
        {
            return FixHtml(s.Substring(3, s.Length - 3));
        }
        else if (s.ToLower().EndsWith("</div>"))
        {
            return FixHtml(s.Substring(0, s.Length - 6));
        }
        else if (s.ToLower().EndsWith("</p>"))
        {
            return FixHtml(s.Substring(0, s.Length - 4));
        }

        return s;
    }
RCIX
+2  A: 

You need to either use a StringBuilder for this to work or make copies of strings in every call of FixHTML for this to work. This is because Strings are immutable in .NET.

You can look here to see what immutable strings are.

Aamir
+2  A: 

If you're planning to do this on a server, you must use a string builder. The reason is memory performance will be HORRENDOUS if you are using strings. Effectively each time you strip a tag from your string, you effectively copy the string. For each recursion (tag) your system will do this, so if you have even a reasonable size HTML input you will very quickly use a huge amount of memory.

EDIT: In reference to Chris' comment, that previous statement is true if you are dealing with large strings. If you're parsing small chunks of HTML using the string builder isn't as important. But I've made an assumption that you are using this on a server in a web environment, so you could be consuming very large pages with it.

Using a string builder as a reference will also allow your function to manipulate a mutable value so at the end of your recursion the StringBuilder.ToString() will correctly output your mutated string.

You should upvote others who mentioned string mutability as your problem if you upvote my solution please :).

I was trying to answer your problem and fix the next one which believe me is a mistake many have made before.

Also note that your code would die on a <br/>

private static string FixHtml(StringBuilder bldr)    
{                    
    if (String.Compare(blder.ToString(0,5), "<div>", true) == 0)        
    {
        blder.remove(0, 5);            
        return FixHtml(blder);        
    }       
    else if (String.Compare(blder.ToString(0,3), "<p>", true) == 0)       
    {
        blder.remove(0, 3);            
        return FixHtml(blder);             
    }        
    else if (String.Compare(blder.ToString(bldr.Length - 6, 6), "</div>", true) == 0)       
    {
        blder.remove(blder.Length - 6, 6);            
        return FixHtml(blder);                   
    }        
    else if (String.Compare(blder.ToString(bldr.Length - 4, 4), "</p>", true) == 0)       
    {        
        blder.remove(blder.Length - 4, 4);            
        return FixHtml(blder); 
    }       
    return blder.ToString();    
}
Spence
WRONG. http://www.codinghorror.com/blog/archives/001218.html
Chris Charabaruk
Chris, we have a saying where I come from that a little bit of knowledge is dangerous.If he is using this algorithm on reasonably sized chunks of HTML then the function is going to have VERY VERY bad performance, memory pressure multiplied by the number of clients calling it. If you use strings to mutate one function once on a reasonable string, you won't see an improvement.Programming is about compromises between memory, performance, correctness and time taken to construct the result. But I will happily show you test cases that I'm not "WRONG".
Spence
For what its worth im it will be for small chunks of code. I' am wondering why would the code fail on a <br /> ? (not at my dev machine to test)
Look at your algorithm. IF you hit a <br/> you wont catch that and the removal will stop, even if there are more tags in there.You also wont be able to deal with tags that have attributes <div class="something">
Spence
ok i see, well we can add a test for <br> or whatever tags are required. i guess to solve the problem with class and/or ids is to have StartsWith("<div").
+5  A: 

Note that raw text manipulation is usually a poor way to treat xml - for example, you aren't handling attributes, namespaces, trailing tag whitespace (<p >) etc at the moment.

Normally, I'd say load it into a DOM (XmlDocument/XDocument for xhtml; HTML Agaility Pack for html) - but actually I wonder whether an xslt would be good in this case...

For example:

static void Main()
{
    string xhtml = @"<div><p>this is a <strong>test</strong></p></div>";
    XslCompiledTransform xslt = new XslCompiledTransform();
    xslt.Load("strip.xslt");

    StringWriter sw = new StringWriter();
    using(XmlReader xr = XmlReader.Create(new StringReader(xhtml))) {
        xslt.Transform(xr, null, sw);
    }
    string newHtml = sw.ToString();
    Console.WriteLine(newHtml);
}

With strip.xslt:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
  <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/>
  <xsl:template match="strong|@*">
    <xsl:copy><xsl:apply-templates select="*|text()"/></xsl:copy>
  </xsl:template>
  <xsl:template match="*">
    <xsl:apply-templates select="*|text()"/>
  </xsl:template>
</xsl:stylesheet>
Marc Gravell
XSLt sounds like a very appropriate tool for solving this problem. Although they aren't that much fun to write...
Spence