tags:

views:

101

answers:

7

I have a huge XML files up to 1-2gb, and obviously I can't parse the whole file at once, I'd have to split it into parts then parse the parts and do whatever with them.

How can I count number of a certain node? So I can keep track on how many parts do I need to split the file. Is there a maybe better way to do this? I'm open to all suggestions thank you

Question update:

Well I did use STAX, maybe the logic I'm using it for is wrong, I'm parsing the file, then for each node I'm getting the node value and store it inside string builder. Then in another method I go trough stringbuilder and edit the output. Then I write that output to the file. I can do no more than 10000 objects like this.

Here is the exception I get :

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at com.sun.org.apache.xerces.internal.util.NamespaceSupport.<init>(Unkno
wn Source)
        at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.setNamespace
Context(Unknown Source)
        at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.getXMLEvent(
Unknown Source)
        at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.allocate(Unk
nown Source)
        at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Sour
ce)
        at com.sun.org.apache.xalan.internal.xsltc.trax.StAXEvent2SAX.bridge(Unk
nown Source)
        at com.sun.org.apache.xalan.internal.xsltc.trax.StAXEvent2SAX.parse(Unkn
own Source)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transfor
mIdentity(Unknown Source)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transfor
m(Unknown Source)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transfor
m(Unknown Source)

Actually I think my whole approach is wrong, what I'm actually trying convert xml files into CSV samples. Here is how I do it so far :

  • Read/parse xml file
  • For each element node get text node value
  • Open stream write it to file(temp), for n nodes then flush and close stream
  • Then open another stream read from temp, use commons strip utils and some other stuff to create proper csv output then write it to csv file
A: 

You'd be better off using an event based parser such as SAX

spender
+3  A: 

The SAX or STAX API would be your best bet here. They don't parse the whole thing at once, they take one node at a time and let your app process it. They're good for arbitrarily large documents.

SAX is the older API, and works on a push model, STAX is newer and is a pull parser, and is therefore rather easier to use, but for your requirements, either one would be fine.

See this tutorial to get you started with STAX parsing.

skaffman
+1 for mentioning that StaX (pull) is easier to use than SAX.
naikus
+1  A: 

I think you want to avoid creating a DOM, so SAX or StAX should be good choices.

With SAX just implement a simlpe content handler that just increments a counter if an interesting element is found.

Andreas_D
+2  A: 

You can use a streaming parser like StAX for this. This will not require you to read the entire file in memory at once.

Gerco Dries
+1  A: 

With SAX you don't have to split the file: It's streaming, so it holds only the current bits in memory. It's very easy to write a ContentHandler that just does the counting. And it's very fast (in my experience, almost as fast as simply reading the file).

Chris Lercher
A: 

I think splitting the file is not the way to go. You'd better handle the xml file as a stream and use the SAX API (and not the DOM API).

Even better, you should use XQuery to handle you requests.

Saxon is a good Java / .Net implementation (using sax), that is amazingly fast, even on big files. Version HE is under a MPL open-source license.

Here is a little example:

java -cp saxon9he.jar net.sf.saxon.Query -qs:"count(doc('/path/to/your/doc/doc.xml')//YouTagToCount)"
alci
A: 

Well I did use STAX, maybe the logic I'm using it for is wrong, I'm parsing the file, then for each node I'm getting the node value and store it inside string builder. Then in another method I go trough stringbuilder and edit the output. Then I write that output to the file. I can do no more than 10000 objects like this.

By this description, I'd say yes, the logic you're using it for is wrong. You're holding on to too much in memory.

Rather than parsing the entire file, storing all the node values into something and then processing the result, you should handle each node as you hit it, and output while parsing.

With more details on what you're actually trying to accomplish and what the input XML and out whatever looks like, we could probably help streamline.

Don Roby