views:

281

answers:

1

Hey guys,

I have a Regex based whitespace filter on an ASP.NET MVC application, and it works perfectly, too perfectly. One of the things that gets filtered are the \r\n characters. This effectively makes everything in one line of source code, which I love because I don't have to deal with quirky CSS because of the whitespace, but in certain instances I need to retain them. One example is when I want to literraly display text with line breaks in it, such as a note.

To do so, I would obviously wrap it in <pre></pre> tags, but because of the filter the linebreaks of text in between the tags also gets scrubbed, so it makes a note for example rather difficult to read.

Can anyone with Regex knowledge (mine is very poor...) help me in modifying the current Regex to ignore text between the <pre> tags?

Here's the current code:

public class WhitespaceFilter : MemoryStream {
    private string Source = string.Empty;
    private Stream Filter = null;

    public WhitespaceFilter(HttpResponseBase HttpResponseBase) {
        Filter = HttpResponseBase.Filter;
    }

    public override void Write(byte[] buffer, int offset, int count) {
        Source = UTF8Encoding.UTF8.GetString(buffer);

        Source = new Regex("\\t", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, string.Empty);
        Source = new Regex(">\\r\\n<", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, "><");
        Source = new Regex("\\r\\n", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, string.Empty);

        while (new Regex("  ", RegexOptions.Compiled | RegexOptions.Multiline).IsMatch(Source)) {
            Source = new Regex("  ", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, string.Empty);
        };

        Source = new Regex(">\\s<", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, "><");
        Source = new Regex("<!--.*?-->", RegexOptions.Compiled | RegexOptions.Singleline).Replace(Source, string.Empty);

        Filter.Write(UTF8Encoding.UTF8.GetBytes(Source), offset, UTF8Encoding.UTF8.GetByteCount(Source));
    }
}

Thanks in advance!

+2  A: 

There are tools like htmlcompressor already out there to strip whitespace. And like exhuma said, if this is for web optimization then gzip compression would help more than anything if you configured it on the web server.

As for your original question, there a lot of different ways to do this. You could also attack the problem with something like XPATH (if the HTML is valid XHTML) and then combine that with regex. But I figured I'd try my hand at writing a single regex to do it:

(<pre>[^<>]*(((?<Open><)[^<>]*)+((?<Close-Open>>)[^<>]*)+)*(?(Open)(?!))</pre>)|[\n\r]

It seems to work for me. Fortunately .NET has an extremely powerful regex engine including a very cool balanced matching feature. I can't explain it any better than Ryan Byington can. But the idea is to match the beginning and ending pre tags first and make sure everything inside is untouched. Then everything around those pre tags gets the rest of the regex applied, "[\n\r]".

To make this work you'd simply do this:

Source = new Regex("(<pre>[^<>]*(((?<Open><)[^<>]*)+((?<Close-Open>>)[^<>]*)+)*(?(Open)(?!))</pre>)|[\n\r]", RegexOptions.Compiled | RegexOptions.Singleline).Replace(Source, "$1");

Note the $1 at the end. This is the part that grabs the results from inside the pre tags and returns them untouched.

Then after that write another line to replace \s\s+ with a single space. I think that should work pretty well.

Steve Wortham
Holy awesomeness, you are awesome! I really have no idea what that says except bits and pieces here and there, but you essentially just took 5 of my Regexes and put them in one and did the `<pre>` fix. You're awesome! Thanks for the assist!
Alex
You're welcome. ;) Most people will tell you that regular expressions suck for parsing HTML because of its nested nature, and they can be right. But I just learned about the balanced matching feature in the .NET regex engine recently. Powerful stuff, that.
Steve Wortham