views:

56

answers:

2

I have a few hand-crafted web pages. When deploying them I would like to run them through a tool so that new smaller HTML files are created, with extraneous whitespace taken out, etc.

We already use YUICompressor for our Javascript and our CSS, and we tend to follow all of the techniques described by the Yahoo performance team.

Is there a good, free tool that does this? I prefer tools that would fit into our deployment process similarly to YUICompressor.

+1  A: 

HTML Tidy does the job.

I use the following on one document that I generate (a rather large one). This saved me about 10% on the post-gzip size.

tidy -c -omit -ashtml -utf8 --doctype strict \
    --drop-proprietary-attributes yes --output-bom no \
    --wrap 0  source.html > target.html
  • -c — Replace surplus presentational tags and attributes
  • -omit — Drop optional end tags
  • -ashtml — use HTML rather than XHTML (HTML is leaner and XHTML provides no benefits for most use cases)
  • -utf8 — So we don't have to use entities for characters outside the character set (entities are more bytes)
  • --doctype strict — use Strict (again, leaner)
  • --drop-proprietary-attributes yes — get rid of proprietary junk
  • --output-bom no — BOMs cause issues in some clients
  • --wrap 0 — Have very long lines
David Dorward
Thanks a lot. I will look into this. For my benefit (and possibly of some of the readers) can you explain the meaning of the various options you pass Thanks again
Markus V.
+1  A: 

Plain old minify will also attack your HTML for you, if you want.

But HTML minification isn't, generally, hugely effective:

  • Taking runs of whitespace down to one won't do that much. If you're already using gzip/deflate, that'll be compressing the whitespace quite efficiently. You can't remove all whitespace as single whitespaces can often have an effect on rendering that it is desirable to keep.

  • Taking comments out may have an effect, depending on how much comment content you actually have. But you'd have to be careful not to hit conditional comments.

  • Apart from that, there is not much in an HTML document that can be ‘minified’. Obviously the JS idea of packing variable names down to the shortest possible string is inapplicable.

  • Doing all this with regex, as most minifiers do, is a bit dodgy. You have to stick to a limited ‘normal’ range of markup that won't trip it up.

With HTML minification you're typically getting less gain (and less post-gzip gain) than JS/CSS minification, and for dynamically-generated pages you have more overhead (as you can't pre-minify them like with static scripts/styles). Some templating languages may already have built-in features for trimming whitespace at generation time; if available in your environment, use that.

bobince
1. Having looked at minify. It does not look like it is suitable for the job. It's more of a tool to wire up together css and javascripts. 2. As I said our html are hand crafted and they are rich in comments and spaces.they look a bit like this: <html>...<body> <!-- here is the content... --> <div id='content'> </div>surely we can gain a lot just by reducing their size at deployment.
Markus V.