views:

39

answers:

1

At work I am parsing large XML files using the DefaultHandler class. Doing that, I noticed that this interface allocates many Strings, for element names, attribute names and values, and so on.

From that, I thought about creating an XML parser that only does the absolute minimum of object allocation. Currently I need:

  • one StringBuilder for building the element names, attribute names, etc.
  • one CharsetDecoder for transforming bytes into chars.

My test program, for parsing http://magnatune.com/info/song_info.xml, looks like this:

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;

public class XmlParserDemo {
  public static void main(String[] args) throws IOException {
    List<Map<String, String>> allSongs = new ArrayList<Map<String, String>>();

    InputStream fis = new FileInputStream("d:/song_info.xml");
    try {
      XmlParser parser = new XmlParser(new BufferedInputStream(fis));
      if (parser.element("AllSongs")) {
        while (parser.element("Track")) {
          Map<String, String> track = new LinkedHashMap<String, String>();
          while (parser.element()) {
            String name = parser.getElementName();
            String value = parser.text();
            track.put(name, value);
            parser.endElement();
          }
          allSongs.add(track);
          parser.endElement();
        }
        parser.endElement();
      }
    } finally {
      fis.close();
    }
  }
}

This code looks better than my experiments with the XMLEventReader. Now the only missing part would be the XmlParser class mentioned in the code above. Do you know if someone has written that code before? It's really just a pet project of mine, but I'm curious how much the old statement Object creation is expensive is worth anymore.

Yes, I know that LinkedHashMaps are using much memory. It's really just the parsing part that I want to be memory-efficient. Everything else is just for making a simple example.

+1  A: 

"Object creation is expensive hasn't been true" for quite a long time in Java. Allocation is usually dirt cheap (move a pointer) and garbage collection has come a long way.

I would definitely use an XML API which lets you do what you want easily rather than worrying too much about excessive memory allocation, unless you think you're going to be pushing your performance boundaries.

I'm sure there are XML APIs designed to have a particularly small memory footprint - but just how large are your XML files? If they're small enough to fit into memory easily, I'd just not worry about it... and if they're too large for that you really need to be thinking of a streaming API anyway. I suspect the band of sizes where a particularly efficient parser could fit it in memory but a "normal" one couldn't is relatively small, in terms of applicability.

Jon Skeet
You convinced me. I read the paper about StAX (at http://java.sun.com/performance/reference/whitepapers/StAX-1_0.pdf) and tried it, and as long as I don't call any unnecessary methods, it's as memory-efficient as I could ever want it.
Roland Illig