views:

32

answers:

2

What libraries / tools are available for tidying up xml?

I've found the highly recommended HtmlTidy, however unfortunately it doesn't correctly handle my input xml files - I mean to submit a bug report, however in the meantime I need a xml tidying tool that works with my xml.

Can anyone suggest any alternatives?

Update: By "Tidy" I mean prettify the xml, so (for example):

<xml><testing attribute="somevalue"><etc /></testing></xml>

Becomes

<xml>
  <testing attribute="somevalue">
    <etc />
  </testing>
</xml>

The bug I'm getting with HtmlTidy

When I get the chance to reproduce it with some xml I can submit in a bug report I intend to do just that, however if you are interested the error I get is a little like this:

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 1 - Warning: plain text isn't allowed in <head> elements
line 1 column 1 - Info: <head> previously mentioned
line 1 column 1 - Warning: inserting implicit <body>
line 1 column 6558 - Error: <myelement> is not recognized!
line 1 column 6558 - Warning: discarding unexpected <myelement>
** snip - around 15 similar errors / warnings **
48 warnings, 22 errors were found! Not all warnings/errors were shown.

This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

Its worth noting that my xml is reasonably large (~18k) and all formatted on a single line, however it is completely valid xml. If I open the file in Visual Studio and use the "prettifier" that VS has, HtmlTidy is able to correctly parse the resulting xml.

A: 

Do you have xmllint? Its --format option will provide nice indented formatting as output.

Matt Gibson
A: 

If you can use XSLT, then you already have a tool which can do this.

Create a stylesheet containing the identity transform, and use the xsl:output's indent attribute to indent the output. Bingo -- tidy XML, by your definition.

Note, by the way, that the XML with added space is not in principle equivalent to the original (there are cases where whitespace is important to XML), but that probably doesn't matter to you.

Norman Gray