views:

535

answers:

4

Hi,

What's the best way to strip out all whitespace from a .Net website? I found this site Whitespace removal - 4Wall Art Site

If you look at the source it's clearly a .net site but all unwanted tabs and spaces are removed. Now I've searched around it seems a regular expression on the page render is the best method but does anyone have any examples? Or any conflicting opinions on whether this is the best way? The html source on that site are down to ~30kb which is something I'm striving toward!

Thanks, Steve

+5  A: 

If you have not yet you would do much better to turn on gzip/deflate compression in IIS. If you are trying to reduce network traffic and improve performance compression has a larger effect then removing white space.

David Waters
+1 This is a better solution for the problem. I don't believe that regex is the right tool here.
Andrew Hare
Thanks David, I'll have a read through this
stibstibstib
Another good artical on the subject is http://weblogs.asp.net/owscott/archive/2004/01/12/57916.aspx
David Waters
+1  A: 

You should almost never try to use a regex on HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). This is really a job for a parser (see What is the best way to parse html in C#? for HTML parsers for C#). The pseudocode for what you want to do is

print tag and attributes with minimal spaces
if tag is in list of tags whose contents can be to modified
    strip redundant whitspace from contents
print contents
print end tag

One example of a tag that should not have its contents modified is the pre tag.

Chas. Owens
A: 

Well, if you really want you could use bash 'sed' and perl regex's will achieve the same thing:

Bash:

cat yourhtmlfile.html | sed 's_\ +_\ _g' > newReducedFile.html

That should achieve what you want. It will one or more spaces into a single space. That should remove most of the unnecessary whitespace from your file. For a .net website you could use perl or python. There are windows versions.

Robert Massaioli
A: 

If you really feel the need to remove white space a place to start would be to look at http://www.codeproject.com/KB/aspnet/WhitespaceFilter.aspx , I stress this should only be a place to start don't just copy the code in the article as the author clearly did not have a good grasp of regular expressions which they use a lot in a very inefficient manner.

How every it does show the technique of using a filter to modify the output of all pages.

David Waters