ansaurus

Question

RE: Big XML file

Answer 1

+5 A:

There's a bug for Java 1.6 which shows the exact same stack trace, and it's unfixed as of now. Newer Xerces versions seem to be fine.

For documents this large, which still contain a fair amount of structure, you could think about using pull-parsing, i.e. parsing of partial structures, for instance with StAX.

Torsten Marek 2009-02-03 21:22:16

I have already tried Stax... it gives the same error

2009-02-03 21:36:09

If it gives the same stack trace, you aren't using StAX. What is the stack trace for StAX?

lavinio 2009-11-11 20:38:06

Answer 2

A:

There seems to be a problem with HTML entites in your code, namely "José" in the first block. At least my browser tells me there's a problem with it when I open the file, and XMLEntityScanner shows up in the stack trace. I'm not an XML expert, but could it be that HTML entities are not in fact defined for XML in general?

Edit Yup, that's it. According to Wikipedia, entities like é are defined in the HTML DTD; XML has only a very small number of predefined entities.

Michael Borgwardt 2009-02-03 21:22:23

All entities are defined in dblp.dtd

2009-02-03 21:34:48

But would that cause a memory error? I'm not an XML expert either, but I would think that bad entities like é would cause SAXExceptions as opposed to memory exceptions.

Michael Angstadt 2009-02-03 21:42:44

That answer is of no benefit whatsoever to this question ...

mark 2009-02-03 22:15:33

Answer 3

A:

I don't know the correct terminology for this, but how "deep" does your XML go? For example, the "author" tag in your example is 2 elements deep. If you have tags that are really really deep, maybe that's why you're having memory issues?

Michael Angstadt 2009-02-03 21:43:56

the deepest level is 2

2009-02-03 21:54:04

Nesting really should not matter: amount of memory used per level is very small for both SAX and Stax. I mean, not unless it's tens of thousands of levels or so. :)

StaxMan 2009-03-31 18:23:07

Answer 4

+2 A:

Well, given:

public class Main {

    /**
     * @param args the command line arguments
     */
    public static void main(String argv[]) {
        Writer out;

        // Use an instance of ourselves as the SAX event handler
        Echo handler = new Echo();
        // Use the default (non-validating) parser
        SAXParserFactory factory = SAXParserFactory.newInstance();
        try {
            // Set up output stream
            out = new OutputStreamWriter(System.out, "UTF8");
            // Parse the input 
            SAXParser saxParser = factory.newSAXParser();
            saxParser.parse(new File("/tmp/dblp.xml"), handler);
        } catch (Throwable t) {
            t.printStackTrace();
        }
        System.out.println("Incollections = " + handler.cnt);
        System.exit(0);
    }

    static class Echo extends DefaultHandler {
        public int cnt = 0;
        @Override
        public void startElement(String namespaceURI,
                String sName, // simple name
                String qName, // qualified name
                Attributes attrs)
                throws SAXException {
            if (qName.equals("incollection")) {
                cnt = cnt + 1;
            }
        }
    }
}

This works for me under Java 5, but I do get the OOM under Java 6.

I run it like this:

java -DentityExpansLimit=512000 -jar xmltest.jar

And it prints:

Incollections = 8353

Which is convenient:

grep "<incollection" /tmp/dblp.xml | wc -l
8353

So, FYI, data point, etc.

Will Hartung 2009-02-03 22:43:22

Thanks a lot… That was the problem, I should compile with Java 5 and extend the entity limit:Java -DentityExpansionLimit=512000 Main

2009-02-04 11:35:32

Answer 5

A:

It sounds like one of text segments (or CDATA, processing instruction, or comment) in XML file is very long, and parser does not split it into multiple segments. Or it could be that parser fails to parse DOCTYPE declaration properly: if so, it might try reading all xml content as if it was part of DTD subset.

But that's just speculation. You mentioned that you have tried Stax: which implementation? JDK 1.6 comes with Sun Sjsxp. But you could also try Woodstox (http://woodstox.codehaus.org), which often handles things in bit more robust way. So if you are not using Woodstox, you could see what happens. It does split text segments into smaller chunks unless you force text coalescing (not default).

Oh and just in case you were testing using Stax reference implementation (http://stax.codehaus.org); it is unfortunately known to be very buggy. So that could cause problems. Both Sjsxp and Woodstox are much better choices with Stax.

StaxMan 2009-03-31 18:27:20

ansaurus

tags:

views:

answers:

RE: Big XML file

related questions