views:

1210

answers:

7

Thanks in advance for your consideration

I need to convert inline css style attributes to their HTML tag equivelants. The solution I have works but runs VERY slowly using the Microsoft .Net Regex namespace and long documents (~40 pages of html). I've tried several variations but with no useful results. I've done a little wrapping around executing the expressions but in the end it's just the built-in regex Replace method that gets called.

I'm sure I'm abusing the greediness of the regex but I'm not sure of a way around it to achieve what I want using a single regex.

I want to be able to run the following unit tests:

    [Test]
    public void TestCleanReplacesFontWeightWithB()
    {
        string html = "<font style=\"font-weight:bold\">Bold Text</font>";
        html = Q4.PrWorkflow.Helper.CleanFormatting(html);
        Assert.AreEqual("<b>Bold Text</b>", html);
    }
    [Test]
    public void TestCleanReplacesMultipleAttributesFontWeightWithB()
    {
        string html = "<font style=\"font-weight:bold; color: blue; \">Bold Text</font>";
        html = Q4.PrWorkflow.Helper.CleanFormatting(html);
        Assert.AreEqual("<b>Bold Text</b>", html);
    }
    [Test]
    public void TestCleanReplaceAttributesBoldAndUnderlineWithHtml()
    {
        string html = "<span style=\"font-weight:bold; color: blue; text-decoration: underline; \">Bold Text</span>";
        html = Q4.PrWorkflow.Helper.CleanFormatting(html);
        Assert.AreEqual("<u><b>Bold Text</b></u>", html);
    }
    [Test]
    public void TestCleanReplaceAttributesBoldUnderlineAndItalicWithHtml()
    {
        string html = "<span style=\"font-weight:bold; color: blue; font-style: italic; text-decoration: underline; \">Bold Text</span>";
        html = Q4.PrWorkflow.Helper.CleanFormatting(html);
        Assert.AreEqual("<u><b><i>Bold Text</i></b></u>", html);
    }
    [Test]
    public void TestCleanReplacesFontWeightWithSpaceWithB()
    {
        string html = "<font size=\"10\" style=\"font-weight: bold\">Bold Text</font>";
        html = Q4.PrWorkflow.Helper.CleanFormatting(html);
        Assert.AreEqual("<b>Bold Text</b>", html);
    }

The regular expresion I am using to achieve this logic works but is VERY slow. The regex in the c# code looks like this:

    public static IReplacePattern IncludeInlineItalicToITag(ICleanUpHtmlFactory factory)
    {
        return factory.CreateReplacePattern("(<(span|font) .*?style=\".*?font-style:\\s*italic[^>]*>)(.*?)</\\2>", "$1<i>$3</i></$2>");
    }
    public static IReplacePattern IncludeInlineBoldToBTag(ICleanUpHtmlFactory factory)
    {
        return factory.CreateReplacePattern("(<(span|font) .*?style=\".*?font-weight:\\s*bold[^>]*>)(.*?)</\\2>", "$1<b>$3</b></$2>");
    }
    public static IReplacePattern IncludeInlineUnderlineToUTag(ICleanUpHtmlFactory factory)
    {
        return factory.CreateReplacePattern("(<(span|font) .*?style=\".*?text-decoration:\\s*underline[^>]*>)(.*?)</\\2>", "$1<u>$3</u></$2>");
    }
+2  A: 

First silly question - are you using RegexOptions.Compiled?

Jon Skeet
yes - I'm using the compiled option - but I've tried both on/off to compare results
+7  A: 

I believe the problem is that if it finds a span|font tag, which has no style attribute defined, it will continue looking for it until the end of the document because of the ".*?". I havent tested it, but changing it to "[^>]*?" might improve performance.

EDIT: Make sure you apply that change for all of the ".*?" you have; even the one capturing the content between tags (use "[^<]*?" there), because if the file is not well-formed, it will capture up to the next closing tag.

Santiago Palladino
Agreed. Any time you're trying to improve on a regex (whether for performance, clarity, or pretty much anything else), .* is the first thing you should look at and try to eliminate.
Dave Sherohman
I totally agree. I partially inherited updating this code so I was trying to do it using the same pattern of using a regex - I probably wouldn't have used a straight regex otherwise.
A: 

Try to use the StringBuilder class in your CleanFormatting routine, instead of the String class. Speeds up string construction quite nicely.

Treb
A: 

.NET regular expressions does not support recursive constructs. PCRE does, but that doesn't matter here.

Concider

<font style="font-weight: bold;"> text1 <font color="blue"> text2 </font> text3 </font>

It would get converted into

<b> text1 <font color="blue"> text2 </b> text3 </font>

My suggestion would be to use a proper markup parser, and maybe use regexp on the values of the style-tags.

Edit: Scratch that. It seems .NET has a construct for balanced, recursive patterns. But not as powerful as those in PCRE/perl.

(?<N>content) would push N onto a stack if content matches
(?<-N>content) would pop N from the stack, if content matches.
(?(N)yes|no) would match "yes" if N is on the stack, otherwise "no".

See http://weblogs.asp.net/whaggard/archive/2005/02/20/377025.aspx for details.

MizardX
It is perfectly possible to handle that in NET. See balancing groups in regexes, it allows you to match balancing parenthesis, for instance.http://oreilly.com/catalog/regex2/chapter/index.html
Santiago Palladino
A: 

Wild guess: I believe the cost comes from the alternative and the corresponding match. You might want to try to replace:

"(<(span|font) .*?style=\".*?font-style:\\s*italic[^>]*>)(.*?)</\\2>", "$1<i>$3</i></$2>"

with two separate expressions:

"(<span .*?style=\".*?font-style:\\s*italic[^>]*>)(.*?)</span>", "$1<i>$2</i></span>"
"(<font .*?style=\".*?font-style:\\s*italic[^>]*>)(.*?)</font>", "$1<i>$2</i></font>"

Granted, that double the parsing of the file, but the regex being simpler, with less trackbacks, it might be faster in practice. It is not very nice (repetition of code) but as long as it works...

Funnily, I did something similar (I don't have the code at hand) to clean up HTML generated by a tool, simplifying it so that JavaHelp can understand it... It is one case where regexes against HTML is OK, because it is not a human making mistakes or changing little things which creates the HTML, but a process with well defined patterns.

PhiLho
A: 

I have similar problem when using regexp to split SQL script for sections from GO to GO. My regexp is: @"([\s\S]?)^GO(\s|$)" and im ectracting Match Group 1 from it. It works very slow with lage sections because of using "?". I know that faster would be something like this @"([^GO]+)(\s|$)" but it won't match string until GO but string until G or O. I do not know how to correct or improve it.

Pa0l0
A: 

During testing i found strange behavior. When run regexp in separate thread it runs a lot faster. I have sql script that i was spliting to sections from Go to Go using regexp. When working on this script without using separate thread it last for about 2 minutes. But when using multithreading it last only few secounds.

Pa0l0