views:

249

answers:

2

I'm parsing an XML document into my own structure but building it is very slow for large inputs is there a better way to do it?

public static DomTree<String> createTreeInstance(String path) 
  throws ParserConfigurationException, SAXException, IOException {
    DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = docBuilderFactory.newDocumentBuilder();
    File f = new File(path);
    Document doc = db.parse(f);       
    Node node = doc.getDocumentElement(); 
    DomTree<String> tree = new DomTree<String>(node);
    return tree;
}

Here is my DomTree constructor:

    /**
     * Recursively builds a tree structure from a DOM object.
     * @param root
     */
    public DomTree(Node root){   
        node = root;     
        NodeList children = root.getChildNodes();
        DomTree<String> child = null;
        for(int i = 0; i < children.getLength(); i++){  
            child = new DomTree<String>(children.item(i));
            if (children.item(i).getNodeType() != Node.TEXT_NODE){
                super.children.add(child);
            }
        }
    }

UPDATE:

I have benchmarked the createTreeInstance() method using a 100MB XML file:

  • Creating docBuilderFactory... Done [3ms]
  • Creating docBuilder... Done [21ms]
  • parsing file... Done [5646ms]
  • getDocumentElement... Done [1ms]
  • creating DomTree... Done [17076ms]

UPDATE:

As John Doe suggests below it may be more appropriate to use SAX - I have never used SAX before, so is there a good way to convert what I have to using SAX?

A: 

Have you tried profiling this ? I think that may be more instructive than looking at the code. It's quite often that a bottleneck shows up that you'd normally never expect. A simple profile (that you can do trivially in code) is to time the DOM parsing vs. your tree building.

For more in-depth profiling, JProfiler is available as an evaluation copy. Others may be able to recommend something more appropriate.

Brian Agnew
I've only benchmarked the larger program that is using it, and it shows that this process is a bottleneck
Robert
So I'd certainly look at the DOM parsing vs. your tree building
Brian Agnew
Creating docBuilderFactory... Done [3ms]Creating docBuilder... Done [21ms]parsing file... Done [5646ms]getDocumentElement... Done [1ms]creating DomTree... Done [17076ms]
Robert
If you're loading a 100Mb doc in, then your memory may be an issue. Try increasing the VM max memory size using -Xmx512m (to allocate up to 512m, or use whatever figure you can)
Brian Agnew
actually its already set to -Xms2g -Xmx2g
Robert
Ah. Just removed that from my answer. Thx
Brian Agnew
+2  A: 

If you're parsing a large XML, you don't use DOM, you use SAX, a pull parser such as XPP3 or anything else.

The problem is that you won't have an "XML tree" in memory which might be convenient, you only get events and deal with them accordingly. However it will be memory wise, and you can map to elements to your data structures.

John Doe
do you have an example?
Robert