tags:

views:

521

answers:

5
+2  Q: 

RE: Big XML file

Followup question to Big XML File:

First thanks a lot for yours answers. After… what I do wrong? This is my class which uses SAX:

public class SAXParserXML extends DefaultHandler {
  public static void ParcourXML() {

      DefaultHandler handler = new SAXParserXML();
      SAXParserFactory factory = SAXParserFactory.newInstance();
      try {
          String URI = "dblp.xml";
          SAXParser saxParser = factory.newSAXParser();
          saxParser.parse(URI,handler);
      } catch (Throwable t) {
     t.printStackTrace ();
       }
  }



  public void startElement (String namespaceURI,String simpleName,String qualifiedName,Attributes attrs) throws SAXException {
  }
  public void endElement (String namespaceURI,String simpleName,String qualifiedName) throws SAXException {

  }
}

You can see that I do nothing with my XML file but it gives this error:

java.lang.OutOfMemoryError: Java heap space
    at com.sun.org.apache.xerces.internal.util.XMLStringBuffer.append(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.refresh(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.invokeListeners(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at SAXParserXML.ParcourXML(SAXParserXML.java:30)
    at Main.main(Main.java:28)

I tried also Stax…the same error… what can I do? Also I increased the Java heap size up to 1260M

java -Xmx1260M SAXParserXML

the XML file has this form:

<dblp> 
   <incollection> 
      <author>... </author> 
      .... 
      <author>... </author> 
      #other tags-i'm interested only by <author>#
      ... 
   </incollection> 
   <incollection> 
   # the same thing# 
   </incollection> 
   .... 
</dblp>

You can find the original file: http://dblp.uni-trier.de/xml/

Thanks

+5  A: 

There's a bug for Java 1.6 which shows the exact same stack trace, and it's unfixed as of now. Newer Xerces versions seem to be fine.

For documents this large, which still contain a fair amount of structure, you could think about using pull-parsing, i.e. parsing of partial structures, for instance with StAX.

Torsten Marek
I have already tried Stax... it gives the same error
If it gives the same stack trace, you aren't using StAX. What is the stack trace for StAX?
lavinio
A: 

There seems to be a problem with HTML entites in your code, namely "Jos&eacute;" in the first block. At least my browser tells me there's a problem with it when I open the file, and XMLEntityScanner shows up in the stack trace. I'm not an XML expert, but could it be that HTML entities are not in fact defined for XML in general?

Edit Yup, that's it. According to Wikipedia, entities like &eacute; are defined in the HTML DTD; XML has only a very small number of predefined entities.

Michael Borgwardt
All entities are defined in dblp.dtd
But would that cause a memory error? I'm not an XML expert either, but I would think that bad entities like é would cause SAXExceptions as opposed to memory exceptions.
Michael Angstadt
That answer is of no benefit whatsoever to this question ...
mark
A: 

I don't know the correct terminology for this, but how "deep" does your XML go? For example, the "author" tag in your example is 2 elements deep. If you have tags that are really really deep, maybe that's why you're having memory issues?

Michael Angstadt
the deepest level is 2
Nesting really should not matter: amount of memory used per level is very small for both SAX and Stax. I mean, not unless it's tens of thousands of levels or so. :)
StaxMan
+2  A: 

Well, given:

public class Main {

    /**
     * @param args the command line arguments
     */
    public static void main(String argv[]) {
        Writer out;

        // Use an instance of ourselves as the SAX event handler
        Echo handler = new Echo();
        // Use the default (non-validating) parser
        SAXParserFactory factory = SAXParserFactory.newInstance();
        try {
            // Set up output stream
            out = new OutputStreamWriter(System.out, "UTF8");
            // Parse the input 
            SAXParser saxParser = factory.newSAXParser();
            saxParser.parse(new File("/tmp/dblp.xml"), handler);
        } catch (Throwable t) {
            t.printStackTrace();
        }
        System.out.println("Incollections = " + handler.cnt);
        System.exit(0);
    }

    static class Echo extends DefaultHandler {
        public int cnt = 0;
        @Override
        public void startElement(String namespaceURI,
                String sName, // simple name
                String qName, // qualified name
                Attributes attrs)
                throws SAXException {
            if (qName.equals("incollection")) {
                cnt = cnt + 1;
            }
        }
    }
}

This works for me under Java 5, but I do get the OOM under Java 6.

I run it like this:

java -DentityExpansLimit=512000 -jar xmltest.jar

And it prints:

Incollections = 8353

Which is convenient:

grep "<incollection" /tmp/dblp.xml | wc -l
8353

So, FYI, data point, etc.

Will Hartung
Thanks a lot… That was the problem, I should compile with Java 5 and extend the entity limit:Java -DentityExpansionLimit=512000 Main
A: 

It sounds like one of text segments (or CDATA, processing instruction, or comment) in XML file is very long, and parser does not split it into multiple segments. Or it could be that parser fails to parse DOCTYPE declaration properly: if so, it might try reading all xml content as if it was part of DTD subset.

But that's just speculation. You mentioned that you have tried Stax: which implementation? JDK 1.6 comes with Sun Sjsxp. But you could also try Woodstox (http://woodstox.codehaus.org), which often handles things in bit more robust way. So if you are not using Woodstox, you could see what happens. It does split text segments into smaller chunks unless you force text coalescing (not default).

Oh and just in case you were testing using Stax reference implementation (http://stax.codehaus.org); it is unfortunately known to be very buggy. So that could cause problems. Both Sjsxp and Woodstox are much better choices with Stax.

StaxMan