views:

93

answers:

4

Hi,

I am currently modifying a piece of code and I am wondering if the way the XML is formatted (tabs and spacing) will affect the way in which it is parsed into the DocumentBuilderFactory class.

In essence the question is...can I pass a big long string with no spacing into the DocumentBuilderFactory or does it need to be formatted in some way?

Thanks in advance, included below is the Class definition from Oracles website.

Class DocumentBuilderFactory

"Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. "

+1  A: 

It should not affect the ability of the parser as long as the string is valid XML. Tabs and newlines are stripped out or ignored by parsers and are really for the aesthetics of the human reader.

Note you will have to pass in an input stream (StringBufferInputStream for example) to the DocumentBuilder as the string version of parse assumes it is a URI to the XML.

Adrian Regan
Thanks for the bit of information at the end, good to know
Ross Alexander
This is not true. I've tested. The DOM objects build from xml string with line feeds and without are different!
sarahTheButterFly
A: 

There shouldn't be any effect regarding the format of the XML-String, but I can remember a strange problem, as I passed a long String to an XML parser. The paser was unable to parse a XML-File as it was written all in one long line.

It may be better if you insert line-breaks, in that kind, that the lines wold not be longer than, lets say 1000 bytes.

But sadly i do neigther remember why that error occured nor which parser I took.

ckuetbach
I think xml parsers ignores line feeds. It is DocumentBuilder that builds different DOM objects depends on xml string with or without line feeds
sarahTheButterFly
You are right, but I remember a Bug in an XML-Api or Lib, that was unable to build the DOM, because of that special implementation, did read only x bytes per line.
ckuetbach
A: 
sarahTheButterFly
A: 

The documents will be different. Tabs and new lines will be converted into text nodes. You can eliminate these using the following method on DocumentBuilderFactory:

But in order for it to work you must set up your DOM parser to validate the content against a DTD or xml schema.

Alternatively you could programmatically remove the extra whitespace yourself using something like the following:

public static void removeEmptyTextNodes(Node node) {
    NodeList nodeList = node.getChildNodes();
    Node childNode;
    for (int x = nodeList.getLength() - 1; x >= 0; x--) {
        childNode = nodeList.item(x);
        if (childNode.getNodeType() == Node.TEXT_NODE) {
            if (childNode.getNodeValue().trim().equals("")) {
                node.removeChild(childNode);
            }
        } else if (childNode.getNodeType() == Node.ELEMENT_NODE) {
            removeEmptyTextNodes(childNode);
        }
    }
}
Blaise Doughan