tags:

views:

20

answers:

2

I have code that creates an XML document that is difficult to read in a basic text editor. I tried using transformer.setOutputProperty(OutputKeys.INDENT, "yes") which is much better but now when I read the XML back in I have all these annoying text nodes that weren't there before. All these text nodes contain a newline character "\n". Is there any way to exclude them when I read the XML back in without having to write code to parse and remove them on my own? Some sort of filter maybe?

EDIT

I checked into Daniel's suggestion to setIgnoringElementContentWhitespace(true) but came across two problems:

  1. I have to put the DOMBuilderFactory into validating mode
  2. Validating mode requires a DTD - I don't have a DTD, the program I am creating allows the user to create new tags on the fly...

So to complicate things a bit more, is there a way to do this without a DTD? Or is there a simple way to create the DTD when I am saving the XML file?

A: 

An XSL Transform would do the trick, this is exactly what XSL is for. Manipulating XML files to present them in a different format. It would be very simple to filter out the offending nodes and just pass everything else through untouched.

Whatever you do, do NOT try and work with regular expressions to parse XML, XML is not a regular language, pursuing regular expressions to parse XML is a road that leads to madness, and worse buggy brittle code.

fuzzy lollipop
A: 

AFAIK do most XML parsers have an option to skip empty text nodes, like they always occur. Xerces does, at least. The feature is called

http://apache.org/xml/features/dom/include-ignorable-whitespace

and allows to disable it (its enabled by default, if I read it right). Description:

True:       Includes text nodes that can be considered "ignorable whitespace" in the DOM tree. 
False:      Does not include ignorable whitespace in the DOM tree. 
Default:    true 
Note:       The only way that the parser can determine if text is ignorable
            is by reading the associated grammar and having a content model
            for the document. When ignorable whitespace text nodes are included
            in the DOM tree, they will be flagged as ignorable. The ignorable 
            flag can be queried by calling the
            TextImpl#isIgnorableWhitespace():boolean method.  
Daniel
Got a 404 on tour link...
BigMac66
Is is no link! It is the features name (gratulate Xerces for that). Search for the link to learn more.
Daniel