ansaurus

Question

Adding consistent whitespace to HTML using Python

Answer 1

+1 A:

If the html is indeed well formed xml, you can use DOM parser.

from xml.dom.minidom import parse, parseString

#if you have html string in a variable
html = parseString(theHtmlString)

#or parse the html file
html = parse(htmlFileName)

print html.toprettyxml()

The toprettyxml() method lets to specify the indent, new-line character and the encoding of the output. You might want to check out the writexml() method also.

Amarghosh 2010-02-17 09:16:42

Thanks to you, I put together a very workable solution before seeing J.F. Sebastian's nicer example above. (I'm submitting it as another answer just so anyone else not wanting to install Beautiful Soup will have another option.) Thanks for the lead!

peppergrower 2010-02-17 22:11:08

Answer 2

+2 A:

Algorithm

Parse html into some representation
Serialize the representation back to html

Example html5lib parser with BeautifulSoup tree builder

#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>"""

soup = parser.parse(c)
print soup.prettify()

Output:

<html>
 <head>
  <title>
   Title
  </title>
 </head>
 <body>
  ......
 </body>
</html>

J.F. Sebastian 2010-02-17 13:05:33

That looks great, thanks!

peppergrower 2010-02-17 22:09:50

Answer 3

+1 A:

I chose J.F. Sebastian's answer because I think it's the simplest and thus the best, but I'm adding another solution for anyone who doesn't want to install Beautiful Soup. (Also, the Beautiful Soup tree builder is going to be deprecated in html5lib 1.0.) This solution was thanks to Amarghosh's tip; I just fleshed it out a bit. Looking at html5lib, I realized that it will output a minidom object natively, which means I can use his suggestion of toprettyxml(). Here's what I came up with:

from html5lib import HTMLParser, treebuilders
from cStringIO import StringIO

def tidy_html(text):
  """Returns a well-formatted version of input HTML."""

  p = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
  dom_tree = p.parseFragment(text)

  # using cStringIO for fast string concatenation
  pretty_HTML = StringIO()

  node = dom_tree.firstChild
  while node:
    node_contents = node.toprettyxml(indent='  ')
    pretty_HTML.write(node_contents)
    node = node.nextSibling

  output = pretty_HTML.getvalue()
  pretty_HTML.close()
  return output

And an example:

>>> text = """<b><i>bold, italic</b></i><div>a div</div>"""
>>> tidy_html(text)
<b>
  <i>
    bold, italic
  </i>
</b>
<div>
  a div
</div>

Why am I iterating over the children of the tree, rather than just calling toprettyxml() on dom_tree directly? Some of the HTML I'm dealing with is actually HTML fragments, so it's missing the <head> and <body> tags. To handle this I used the parseFragment() method, which means I get a DocumentFragment in return (rather than a Document). Unfortunately, it doesn't have a writexml() method (which toprettyxml() calls), so I iterate over the child nodes, which do have the method.

peppergrower 2010-02-17 22:39:05

ansaurus

tags:

views:

answers:

Adding consistent whitespace to HTML using Python

Algorithm

Example html5lib parser with BeautifulSoup tree builder

related questions