views:

52

answers:

1

I am trying to create an object tree from large number of xmls. However, when I run the following code on about 2000 xml files(ranging from 100KB to 200MB) (note that I have commented out the code that creates object tree), I get a large memory footprint of 8-9GB. I expect memory footprint to be minimum in the following example because the code doesn't doen't hold any references, it justs creates Elem and throws it away. The heap memory stays the same after running full GC.

def addDir(dir: File) {
dir.listFiles.filter(file => file.getName.endsWith("xml.gz")).foreach { gzipFile =>
    addGzipFile(gzipFile)
}
}
def addGzipFile(gzipFile: File) {
val is = new BufferedInputStream(new GZIPInputStream(new FileInputStream(gzipFile)))
val xml = XML.load(is)
// parse xml and create object tree
is.close()
}

My JVM options are: -server -d64 -Xmx16G -Xss16M -XX:+DoEscapeAnalysis -XX:+UseCompressedOops

And the output of jmap -histo looks like this

num     #instances         #bytes  class name
----------------------------------------------
   1:      67501390     1620033360  scala.collection.immutable.$colon$colon
   2:      37249187     1254400536  [C
   3:      37287806     1193209792  java.lang.String
   4:      37200976      595215616  scala.xml.Text
   5:      18600485      595215520  scala.xml.Elem
   6:       3420921       82102104  scala.Tuple2
   7:        213938       58213240  [I
   8:       1140334       36490688  scala.collection.mutable.ListBuffer
   9:       2280468       36487488  scala.runtime.ObjectRef
  10:       1140213       36486816  scala.collection.Iterator$$anon$24
  11:       1140210       36486720  scala.xml.parsing.FactoryAdapter$$anonfun$startElement$1
  12:       1140210       27365040  scala.collection.immutable.Range$$anon$2
...
Total     213412869     5693850736
+2  A: 

I cannot reproduce this behavior. I use the following program:

import java.io._
import xml.XML

object XMLLoadHeap {

  val filename = "test.xml"

  def addFile() {
    val is = new BufferedInputStream(new FileInputStream(filename))
    val xml = XML.load(is)
    is.close()
    println(xml.label)
  }

  def createXMLFile() {
    val out = new FileWriter(filename)
    out.write("<foo>\n")
    (1 to 100000) foreach (i => out.write("  <bar baz=\"boom\"/>\n"))
    out.write("</foo>\n")
    out.close()
  }

  def main(args:Array[String]) {
    println("XMLLoadHeap")
    createXMLFile()
    (1 to args(0).toInt) foreach { i => 
      println("processing " + i)
      addFile()
    }
  }

}

I run it with the following options: -Xmx128m -XX:+HeapDumpOnOutOfMemoryError -verbose:gc and it basically looks like it can run indefinitely.

You can try to see if it does this when using only your largest XML file. It's possible the issue is not with processing many files, but just processing the biggest file. When testing here with a dummy 200MB XML file on a 64 bits machine, I see that I need around 3G of memory. If that's the case, you may need to use a pull parser. See XMLEventReader.

Other than that, assuming you don't create the object tree, you can use -Xmx4G -XX:+HeapDumpOnOutOfMemoryError and then analyze the heap dump with a tool like MAT. 4GB should be sufficient to parse the largest XML file and by the time you get an out of memory error, there may be enough objects allocated to pinpoint which object is preventing GC. Most likely that will be an object holding on to the various parsed XML objects.

huynhjl
Ran the program (from scala console, so that the vm stays alive) for single largest xml file (438MB). The heap usage doesn't seem to be the problem
Sachin Kanekar
Ran the program (from scala console, so that the vm stays alive) for single largest xml file (438MB). Took heap summary after loading the file and running full gc. The heap usage doesn't seem to be the problem as only 111MB of old generation (and 0 of young generation) being used. However, the output of `top` command shows the residual (RES) size of 4.8GB.
Sachin Kanekar
On ther other hand, running with 32bit (3GB) heap throws : java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.xml.parsing.FactoryAdapter.startElement(FactoryAdapter.scala:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:501) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanStartElement(XMLDocumentFragmentScannerImpl.java:1363) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next...
Sachin Kanekar
@Sachin Is that parsing a single file? I wouldn't be too worried about the RES stat from top. If you use -Xmx16G, I would expect the JVM to somewhat make use of it. It seems Scala is fairly memory hungry when representing XML, but I don't think it's holding onto references unnecessarily. Depending on what you're trying to do, using XML.load may not be the right approach.
huynhjl
@huynhjl I have now switched to XMLEvenReader. And could load 21G xmls into an object tree that has 16G heap size. Two things that helped were 1) Using string internalisation and 2) using CompressedOops.
Sachin Kanekar
@Sachin impressive, glad that things worked out.
huynhjl