tags:

views:

275

answers:

3

I have an application where I need to parse or tokenize XML and preserve the raw text (e.g. don't parse entities, don't convert whitespace in attributes, keep attribute order, etc.) in a Java program.

I've spent several hours today trying to use StAX, SAX, XSLT, TagSoup, etc. before realizing that none of them do this. I can't afford to spend much more time attacking this problem, and parsing the text manually seems highly nontrivial. Is there any Java library that can help me tokenize the XML?

edit: why am I doing this? -- I have a large XML file that I want to make a small number of localized changes programmatically, that need to be reviewed. It is highly valuable to be able to use a diff tool. If the parser/filter normalizes the XML, then all I see is "red ink" in the diff tool. The application that produces the XML in the first place isn't something that I can easily have changed to produce "canonical XML", if there is such a thing.

+2  A: 

I think you might have to generate your own grammar.

Some links:

ykaganovich
I think you might be right, and it's painful. I don't know how to use ANTLR, and I can't even find a machine-readable BNF for XML. W3C has http://www.w3.org/TR/xml11/ but the BNF is interspersed with text. >:(
Jason S
It was painful, but as far as I can tell, I got it done OK, and in the future I can use the code I wrote.
Jason S
Glad it worked out for you. For anyone else needing to extract BNF from the XML spec, it can be scraped from the XML version (currently at http://www.w3.org/TR/xml11/REC-xml11-20060816.xml). Search for <scrap lang="ebnf"> elements
ykaganovich
Please be a good citizen and elaborate on how just solved it so that everybody can learn from your experience.
Martin Spamer
@Martin: I used "pure Java", no libraries, didn't use BNF, and wrote a tokenizer to parse XML in a way that preserves the original text for each element.
Jason S
+2  A: 

I don't think any XML parser will do what you want. Why ? For instance, the XML spec doesn't enforce attribute ordering. I think you're going to have to parse it yourself, and that is non-trivial.

Why do you have to do this ? I'm guessing you have some client 'XML' that enforces or relies on non-standard construction. In that case I'd push back and get that fixed, rather than jump through numerous fixes to try and accommodate this.

Brian Agnew
A: 

I'm not entirely sure that I understand what it is you are trying to do. Have you tried using CDATA regions for the parts of the document you don't want the parser to touch?

Also relying on attribute order is not a good idea - if I remember the XML standard correctly then order is never to be expected.

It sounds like you are dealing with some malformed XML and that it would be easier to first turn it into proper XML.

Martin Skøtt