tags:

views:

117

answers:

3

I've got a large-ish (90MB) XML file from Excel, saved in XML Spreadsheet 2003 format. It contains various invalid data, so that Firefox spits out messages like this:

Line Number 790402, Column 65:
<Cell ss:StyleID="s18"><Data ss:Type="String">Here's some data I&#5;?Bnternational</Data></Cell>

Is there a tool that'll parse my XML and tell me what's wrong with it, in a similar way to Firefox? Firefox is quite slow at parsing it (presumably because it's keeping it all in memory ready to render into a nice navigable tree). I'm not bothered about validation against an XSD, just want to know if the XML is well-formed.

+4  A: 

There's a linux command called xmllint that is good for this. It's very fast, handles honking great files without barfing, and gives useful validation error messages.

skaffman
Cool stuff. Even validation supported... How could I ever live without it? +1
Boldewyn
The --format option is also very handy
skaffman
+1  A: 

You could use features of other languages for that. E.g., a two-liner in Python:

import xml.dom.minidom as dom
dom.parse ('test.xml')

This will show the problem, and is quite performant. I remember there was an XML toolkit that worked quite well from within bash, but I can't find a link to that right now.

Cheers,

Edit: This question's answer suggested using SAX over dom, since it'd be more performant. A ready-to-use Python script would then look something like this:

#!/usr/bin/env python
import xml.sax as sax
parser = sax.make_parser ()
parser.parse (open ('test.xml'))

Edit 2: I remember again, the tool was XMLStarlet. I found it to be quite nice, when I used it two years ago.

Boldewyn
Personally I've always preferred dom parsing ;-)
Dominic Rodger
Me too, but for really large XML files you'll be happy for every bit of performance you can squeeze from the tool.
Boldewyn
+1  A: 

I always recommend the XML Starlet command line utilities.

They provide validation, querying, formatting, editing of documents straight from the command line, and they're invaluable for this sort of work, or sanity-checking documents, chopping sections out via XPath etc.

Brian Agnew
Haha! You're too late. I remembered it just on time. ;-)
Boldewyn
It obviously made a big impression on you
Brian Agnew
Yes, three years ago, when I started with XML, it was a hassle to work with it on the command line. Then I found XML Starlet and suddenly it got really cool working with XML in my bash scripts.
Boldewyn