views:

344

answers:

2

I am using Jinja2 to generate HTML files which are typically very huge in size. I noticed that the generated HTML had a lot of whitespace. Is there a pure-Python tool that I can use to minimize this HTML? When I say "minimize", I mean remove unnecessary whitespace from the HTML (much like Google does -- look at the source for google.com, for instance)

I don't want to rely on libraries/external-executables such as tidy for this.

For further clarification, there is virtually no JavaScript code. Only HTML content.

+2  A: 

If you just want to get rid of excess whitespace, you can use:

>>> import re
>>> html_string = re.sub(r'\s\s+', ' ', html_string)

or:

>>> html_string = ' '.join(html_string.split())

If you want to do something more complicated than just stripping excess whitespace, you'll need to use more powerful tools (or more complex regexps).

Edward Loper
This way you will also strip spaces between words in text like in paragraphs and tag attributes and make HTML invalid
Evgeny
+2  A: 

You might also investigate Jinja's built-in whitespace control, which might alleviate some of the need for manually removing whitespace after your templates have been rendered.

Quoting the docs:

But you can also strip whitespace in templates by hand. If you put an minus sign (-) to the start or end of an block (for example a for tag), a comment or variable expression you can remove the whitespaces after or before that block:

{% for item in seq -%}
    {{ item }}
{%- endfor %}

This will yield all elements without whitespace between them. If seq was a list of numbers from 1 to 9 the output would be 123456789.

Will McCutchen
But this only handles whitespace between blocks -- not within blocks, or non-block content (such has hand-written paragraphs)
Sridhar Ratnakumar