views:

181

answers:

9

Are there any compression algorithms -- lossy or lossless -- that have been specifically adapted to deal with real-world (messy and invalid) HTML content?

If not, what characteristics of HTML could we take advantage of to create such an algorithm? What are the potential performance gains?

Also, I'm not asking the question to serve such content (via Apache or any other server), though that's certainly interesting, but to store and analyze it.

Update: I don't mean GZIP -- that's obvious -- but rather an algorithm specifically designed to take advantage of characteristics of HTML content. For example, the predictible tag and tree structure.

A: 

if i understand your question correctly what you need is gz compression, which is available pretty easily with Apache.

Sabeen Malik
+1: Gzip is optimized for text content and HTML is often just simple ASCII. There are modules for Apache that can gzip on-the-fly.
dbemerlin
A: 

Gzip is comonnly used to compress webpages (e.g: HTML), but some versions of IE doesn't support it.

Wikipedia article

erenon
+1  A: 

About the only "lossy" I am willing to deal with in HTML content, messy or not, is whitespace flattening. This is a typical post-publish step that high volume sites perform on their content, also called flattening.

You can also flatten large Javascript libs using the YUI compressor, which renames all Javascript vars to short names, removes whitespace, etc. It is very important for large apps using kits like ExtJS, Dojo, etc.

mrjoltcola
A: 

Run your code thru some HTML minificator/obfuscator that removes as much markup as possible, then let your web server compress it with gzip.

Tronic
A: 

No, there are not any HTML-specific compression algorithms, because the general-purpose ones have proved adequate.

The potential gains would come from knowing ahead of time the likely elements of an HTML page - you could start with a predefined dictionary that would not have to be part of the compressed stream. But this would not give a noticeable gain, as compression algorithms are extraordinarily good at picking out common sub-expressions on the fly.

Mark Ransom
A: 

You would usually use a common algorithm like gzip which is supported by most browsers through the HTTP protocol. The Apache documentation shows how to enable mod_deflate without breaking the browser support of your website.

Additionally you can minimize static HTML files (or do that dynamically).

AndiDog
Why the downvote?
AndiDog
+1  A: 

Is gzip compression not sufficient for your needs? It gives you about 10:1 compression ratio, not only with HTML contents but also with JavaScript, CSS etc. files, and is readily available on most servers or reverse proxies (e.g. Apache's mod_deflate, Nginx's NginxHttpGzipModule etc.) and all modern browsers (you can instruct both Apache and Nginx to skip compression for specific browsers based on User-Agent.)

You'll be surprised how close gzip compression comes to optimal. Some people have suggested minifying your files; however, unless your files contain lots of comments (which the minifier can discard completely, i.e. what you probably referred to as "lossy" -- but something you probably don't want to do with HTML anyway, not unless you're sure that none of your <script> or <style> tags are inside HTML comments <!-- --> to accommodate antediluvian browsers), remember that minifying achieves most of its gains from a technique similar to (yet more limited than) DEFLATE -- so expect a minified file to be larger or much larger than a gzipped original (particularly true with HTML, in which you are stuck with W3C's tags and attributes, and only gzip can help you there), and that gzipping a minified file will give you minimal gain over gziping the original file (again, unless the original file contained lots of comments which can be safely discarded by a minifier.)

vladr
gzip is sufficient, but I'm a computer scientist -- I want optimal. :)
hmason
see my update to address the question of "optimal". :)
vladr
Minified gzip works very well too.
Justice
I think the question is more along the lines of "I know I can compress HTML by removing extraneous whitespace, what other compression techniques can I perform and still have valid HTML?"
Bryan Oakley
Not at the time when the question (and the gzip answers) were originally posted. Moreover, I clearly make the case that non-gzip techniques are in most cases futile. You may not have read the post thoroughly, but the "...what other techniques...", namely removing HTML comments, is also addressed.
vladr
+1  A: 

I do not know of "off-the-shelf" compression library explicitly optimized for HTML content.

Yet, HTML text should compress quite nicely with generic algorithms (do read the bottom of this answer for better algorithms). Typically all variations on Lempel–Ziv perform well on HTML-like languages, owing to the highly repeatitive of specific language idioms; GZip, often cited uses such a LZ-based algoritm (LZ77, I think).

An idea to maybe improve upon these generic algorithms would be to prime a LZ-type circular buffer with the most common html tags and patterns at large. In this fashion, we'd reduce the compressed size by using citations from the very first instance of such a pattern. This gain would be particularly sensitive on smaller html documents.

A complementary, similar, idea, is to have the compression and decompression methods imply (i.e. not send) the info for other compression's algorithm of an LZ-x algorithm (say the Huffman tree in the case of LZH etc.), with statistics specific to typical HTML being careful to exclude from characters count the [statistically weighted] instances of character encoded by citation. Such a filtered character distribution would probably become closer to that of plain English (or targeted web sites' national languge) than the complete HTML text.


Unrelated to the above [educated, I hope] guesses, I started searching the web for information on this topic.

' found this 2008 scholarly paper (pdf format) by Przemysław Skibiński of University of Wrocław. The paper's abstract indicates a 15% improvement over GZIP, with comparable compression speed.

I may be otherwise looking in the wrong places. There doesn't seem to be much interest for this. It could just be that the additional gain, relative to a plain or moderately tuned generic algorithm wasn't deemed sufficient enough to warrant such interest, even in the early days of Web-enabled cell phones (when bandwidth was at quite a premium...).

mjv
A: 

Use S-expressions instead, saves you a number of characters per tag :)

leppie