views:

112

answers:

3

I just started working on a website that is full of pages with all their HTML on a single line, which is a real pain to read and work with. I'm looking for a tool (preferably a Python library) that will take HTML input and return the same HTML unchanged, except for adding linebreaks and appropriate indentation. (All tags, markup, and content should be untouched.)

The library doesn't have to handle malformed HTML; I'm passing the HTML through html5lib first, so it will be getting well-formed HTML. However, as mentioned above, I would rather it didn't change any of the actual markup itself; I trust html5lib and would rather let it handle the correctness aspect.

First, does anyone know if this is possible with just html5lib? (Unfortunately, their documentation seems a bit sparse.) If not, what tool would you suggest? I've seen some people recommend HTML Tidy, but I'm not sure if it can be configured to only change whitespace. (Would it do anything except insert whitespace if it were passed well-formed HTML to start with?)

+1  A: 

If the html is indeed well formed xml, you can use DOM parser.

from xml.dom.minidom import parse, parseString

#if you have html string in a variable
html = parseString(theHtmlString)

#or parse the html file
html = parse(htmlFileName)

print html.toprettyxml()

The toprettyxml() method lets to specify the indent, new-line character and the encoding of the output. You might want to check out the writexml() method also.

Amarghosh
Thanks to you, I put together a very workable solution before seeing J.F. Sebastian's nicer example above. (I'm submitting it as another answer just so anyone else not wanting to install Beautiful Soup will have another option.) Thanks for the lead!
peppergrower
+2  A: 

Algorithm

  1. Parse html into some representation
  2. Serialize the representation back to html

Example html5lib parser with BeautifulSoup tree builder

#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>"""

soup = parser.parse(c)
print soup.prettify()

Output:

<html>
 <head>
  <title>
   Title
  </title>
 </head>
 <body>
  ......
 </body>
</html>
J.F. Sebastian
That looks great, thanks!
peppergrower
+1  A: 

I chose J.F. Sebastian's answer because I think it's the simplest and thus the best, but I'm adding another solution for anyone who doesn't want to install Beautiful Soup. (Also, the Beautiful Soup tree builder is going to be deprecated in html5lib 1.0.) This solution was thanks to Amarghosh's tip; I just fleshed it out a bit. Looking at html5lib, I realized that it will output a minidom object natively, which means I can use his suggestion of toprettyxml(). Here's what I came up with:

from html5lib import HTMLParser, treebuilders
from cStringIO import StringIO

def tidy_html(text):
  """Returns a well-formatted version of input HTML."""

  p = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
  dom_tree = p.parseFragment(text)

  # using cStringIO for fast string concatenation
  pretty_HTML = StringIO()

  node = dom_tree.firstChild
  while node:
    node_contents = node.toprettyxml(indent='  ')
    pretty_HTML.write(node_contents)
    node = node.nextSibling

  output = pretty_HTML.getvalue()
  pretty_HTML.close()
  return output

And an example:

>>> text = """<b><i>bold, italic</b></i><div>a div</div>"""
>>> tidy_html(text)
<b>
  <i>
    bold, italic
  </i>
</b>
<div>
  a div
</div>

Why am I iterating over the children of the tree, rather than just calling toprettyxml() on dom_tree directly? Some of the HTML I'm dealing with is actually HTML fragments, so it's missing the <head> and <body> tags. To handle this I used the parseFragment() method, which means I get a DocumentFragment in return (rather than a Document). Unfortunately, it doesn't have a writexml() method (which toprettyxml() calls), so I iterate over the child nodes, which do have the method.

peppergrower