tags:

views:

453

answers:

4

I have been using DOM for a long time and as such DOM parsing performance wise has been pretty good. Even when dealing with XML of about 4-7 MB the parsing has been fast. The issue we face with DOM is the memory footprint which become huge as soon as we start dealing with large XMLs.

Lately I tried moving to Stax (Streaming parsers for XML) which are supposed top be second generation parsers (reading about Stax it said its the fastest parser now). When I tried Stax parser for large XML for about 4MB memory footprint definitely reduced drastically but time take to parse entire XML and create java object out of it increased almost by 5 times over DOM.

I used sjsxp.jar implementation of Stax.

I can deduce to some extent logically that performance may not be extremely good due to streaming nature of the parser but a reduction of 5 time (e.g. DOM takes about 8 seconds to build object for this XML, whereas Stax parsing took about 40 seconds on average) is definitely not going to be acceptable.

Am I missing some point here completely as I am not able to come to terms with these performance numbers

A: 

Classic case of speed/memory tradeoff in my humble opinion. Not much you can do apart from trying SAX as well (or JDOM) and measure again.

kazanaki
A: 

Try creating an XML with 2000M and then compare the numbers. I guess DOM based approach will work faster on smaller data. Stax (or any sax based approach) will the option as the data gets larger.

(We deal with 3G or large files.. DOM does not even start the application.)

Jayan
+1  A: 

Although question lacks some details, I am pretty sure that the answer is that it's not parsing that is slow in either case (DOM is not parser; DOM trees are typically built using SAX or Stax parsers), but code above it that creates objects.

There are efficient automatic data binders, including JAXB (and with proper settings, XStream), which could help. They are faster than DOM, because the main performance problem with DOM (and JDOM, Dom4j and XOM) is that tree models are inherently expensive compared to POJOs -- they are basically glorified HashMaps, with lots of pointers for convenient untyped traversal; especially regarding memory usage.

As to parsers, Woodstox is faster Stax parser that Sjsxp; and Aalto is even faster if raw speed is of essence. But I doubt main issue is parser speed here.

StaxMan
+1  A: 

package parsers;

/** * * @author Arthur Kushman */

import java.io.File; import java.io.IOException;

import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document; import org.w3c.dom.Node; import org.w3c.dom.NodeList; import org.w3c.dom.Element;

public class DOMTest {

public static void main(String[] args) { long time1 = System.currentTimeMillis(); try { DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); DocumentBuilder db = dbf.newDocumentBuilder(); Document doc = db.parse(new File("/Users/macpro/Desktop/myxml.xml")); doc.getDocumentElement().normalize(); // System.out.println("Root Element: "+doc.getDocumentElement().getNodeName()); NodeList nodeList = doc.getElementsByTagName("input"); // System.out.println("Information of all elements in input");

for (int s=0;s<nodeList.getLength();s++) {
  Node firstNode = nodeList.item(s);
  if (firstNode.getNodeType() == Node.ELEMENT_NODE) {
    Element firstElement = (Element)firstNode;
    NodeList firstNameElementList = firstElement.getElementsByTagName("href");
    Element firstNameElement = (Element)firstNameElementList.item(0);
    NodeList firstName = firstNameElement.getChildNodes();
    System.out.println("First Name: "+((Node)firstName.item(s)).getNodeValue());        
  }
}

} catch (Exception ex) { System.out.println(ex.getMessage()); System.exit(1); } long time2 = System.currentTimeMillis() - time1; System.out.println(time2); }

}

45 mills

package parsers;

/** * * @author Arthur Kushman / import javax.xml.stream.; import java.io.*; import javax.xml.namespace.QName;

public class StAXTest {

public static void main(String[] args) throws Exception { long time1 = System.currentTimeMillis(); XMLInputFactory factory = XMLInputFactory.newInstance(); // factory.setXMLReporter(myXMLReporter); XMLStreamReader reader = factory.createXMLStreamReader( new FileInputStream( new File("/Users/macpro/Desktop/myxml.xml")));

/*String encoding = reader.getEncoding();

System.out.println("Encoding: "+encoding);

while (reader.hasNext()) {
  int event = reader.next();
  if (event == XMLStreamConstants.START_ELEMENT) {
    QName element = reader.getName();
    // String text = reader.getText();
    System.out.println("Element: "+element);
    // while (event != XMLStreamConstants.END_ELEMENT) {
      System.out.println("Text: "+reader.getLocalName());
    // }
  }
}*/

try { int inElement = 0; for (int event = reader.next();event != XMLStreamConstants.END_DOCUMENT; event = reader.next()) { switch (event) { case XMLStreamConstants.START_ELEMENT: if (isElement(reader.getLocalName(), "href")) { inElement++; } break; case XMLStreamConstants.END_ELEMENT: if (isElement(reader.getLocalName(), "href")) { inElement--; if (inElement == 0) System.out.println(); } break; case XMLStreamConstants.CHARACTERS: if (inElement>0) System.out.println(reader.getText()); break; case XMLStreamConstants.CDATA: if (inElement>0) System.out.println(reader.getText()); break; } } reader.close(); } catch (XMLStreamException ex) { System.out.println(ex.getMessage()); System.exit(1); } // System.out.println(System.currentTimeMillis()); long time2 = System.currentTimeMillis() - time1; System.out.println(time2); }

public static boolean isElement(String name, String element) { if (name.equals(element)) return true; return false; }

}

23 mills

StAX wins =)

Arthur Kushman
thanks for a detailed analysis
Fazal