views:

91

answers:

2

I'm looking to write an algorithm to compress HTML output for a CMS I'm writing in PHP, written with the CodeIgniter framework.

I was thinking of trying to remove whitespace between any angle brackets, except the <script>, <pre>, and <style> elements, and simply ignoring those elements for simplicity. I should clarify that this is whitespace between consecutive tags, with no text between them.

How should I go about parsing the HTML to find the whitespace I want to remove?

Edit: To start off, I want to remove all tab characters that are not in <pre> tags. This can be done with regex, I'm sure, but what are the alternatives?

+4  A: 

Is there something wrong with the existing HTML minification solutions?

Minify does HTML (as well as CSS and JS).

(That second link goes to the source code, which comments the steps it takes - should be a good leg up if you did want to create your own - it's BSD licensed.)

Also, as Pete says, you'll benefit much more by using gzip compression for your HTML (and CSS/JS/etc), and wont get tripped up by problems such as Gordon mentioned in his comment.

Peter Boughton
I'd forgotten than Minify worked with HTML. I already use it for JS and CSS minification. Also, I'm curious if it can be done without so many regular expressions. I was under the impression that regex was rather inefficient.
timw4mail
The problem with RegEx is less efficiency and more than HTML is not a Regular language, so cannot be correctly parsed with regex. You could investigate PHP's HTML DOM parsing (http://php.net/manual/en/book.dom.php) and consider writing something that used that and then just output again without whitespace.
Peter Boughton
+7  A: 

Don't. Whitespace is negligible. Better to be using output compression, with zlib or here for example

Pete
I already output with compression, and am curious to see if I can minify HTML successfully.
timw4mail
The effort doesn't justify the savings, and it will potentially cause problems at some point, such as that Gordon has mentioned.
Pete
I realize it probably isn't worth the effort, but I was trying to do it for the challenge.
timw4mail