views:

85

answers:

4

Hi all.

Minimizing html is the only section on Google's Page Speed where there is still room for improvement.

My site is all dynamic and the HTML is already Deflated so there is no reason to put any more pressure on the server (I don't want to minimize pages real time before sending).

What I could do was to minimize the template files. My templates files are a mix of PHP and HTML so I've come up with some code that I think is pretty safe but would like to be community revised.

// this will loop trough all template files
// php is cleaned first so that line-comments will not interfere with the regex
$original = file_get_contents($dir.'/'.$file);
$php_clean = php_strip_whitespace($dir.'/'.$file);
$minimized = preg_replace('/\s+/', ' ', $php_clean);

This will make my template files as a single very long file alternated with some places where DB content is inserted. Google's homepage source looks more or less like what I get so I wonder if they follow a similar approach.

Question 1: Do you antecipate potencial problems?
Question 2: Is there anyway better (more efficient to do this)?

And please remember that I'm not trying to validate HTML as the templates are not valid HTML (header and footer are includes, for example).

Edit: Do take into consideration that the template files will be minimized on deploy. As CSS and Javascript files are minimized and compressed using YUI Compressure and Closure, the template files would be minimized like-wise, on deploy. Not on client-request.

Thank you.

+1  A: 

White space can be significant (e.g. in pre elements).

When I had a particularly large page (i.e. large enough that there was a benefit in minifying the HTML) I used HTML Tidy and cached the results.

tidy -c -n -omit -ashtml -utf8 --doctype strict \
    --drop-proprietary-attributes yes --output-bom no \
    --wrap 0
David Dorward
@David thanks for your input. I tried Tidy but it seamed overkill because it was expecting a full-html page and I'm just hitting the templates. Example: the <html> tag opened on the `header template` is only closed on the `footer template`. Tidy seams more fit to run on **final html**. That wont work on this case because, like said, all pages on site are dynamic.
Frankie
apart from pre, if you have some scripting inside the html, single line comments would also create a problem there..
Ravindra Sane
@Ravindra, nice one, thks! Actually all the scripts come from external javascript files so in that scenario I would be ok. Makes me wonder if there are any more potential problems...
Frankie
+1  A: 

I think you'll end up running into issues with load time with this approach, as the get contents, strip whitespace, and preg replace calls are going to take a lot longer to do than whatever bandwidth the minified HTML is saving you.

GSto
@GSto do remember that this is done prior to any request. The templates would be minimized **before** sending the files to production (I should have been more clear about that).
Frankie
A: 

I've been running tests on all my sites for a couple of weeks and I can say that this method is pretty consistent. It will only affect template content, so there is little risk of messing up with unknown <pre> or similar.

It is run before deploy so there is no impact on server - actually there should be a little speed up as the file becomes smaller.

Do remember that all content that comes from the database will not suffer any influence as, like said before, this runs before deploy and on template files only.

The method seams solid enough to pass it into production.

If anything goes wrong I'll post it here.

Frankie
+1  A: 

Google's own Closure Templates (Soy) strips whitespace at the end of the line by default, and the template designer explicitly inserts a space using {sp}. This probably isn't a good enough reason to switch away from PHP, but I just wanted to bring it to your attention.

In addition, realize that HTML 4 allows you to exclude some tags, as recommended by the Page Speed documentation on minifying HTML (http://code.google.com/p/page-speed/wiki/MinifyHtml). You can exclude </p>, </td>, </tr>, etc. For a complete list of elements for which you can omit the end tag, search for "- O" in the HTML 4 DTD (http://www.w3.org/TR/REC-html40/sgml/dtd.html). You can even omit the <html>, <head>, <body>, and <tbody> tags entirely, as both start and end tags are optional ("O O" in the DTD).

You can also omit the quotes around attributes (http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.2) such as id, class (with a single class name), and type that have simple content (i.e., matches /^[-A-Za-z0-9._:]+$/). For attributes that have a single possible value, you can exclude the value (e.g., say simply checked rather than checked=checked).

Some people may find these tips repulsive because we've been conditioned for so many years to prepare for the upcoming world of simple LALR parsers for XHTML. Thus, tools like Dave Raggett's HTML Tidy generate HTML with proper closing tags and quotes around attribute values. But let's face it, all the browsers already have parsers that understand HTML 4, any new browser will use the HTML 5 parser rather than XHTML, and we should get comfortable writing HTML that is optimized for size.

That being said, besides a couple large companies like Google and Facebook, my guess is that page size is a negligible component of latency, so if you're optimizing your own site it's probably because of your own obsessive tendencies rather than performance.

yonran
@user471341 thks! This is indeed a very nice comment!
Frankie