tags:

views:

130

answers:

3

I need to process a huge XML file, 4G. I used dom4j SAX, but wrote my own DefaultElementHandler. Code framework as below:

SAXParserFactory sf = SAXParserFactory.newInstance();   
SAXParser sax = sf.newSAXParser();   
sax.parse("english.xml", new DefaultElementHandler("page"){   
public void processElement(Element element) { 
// process the element
}
});

I thought I was processing the huge file "page" by "page". But it seems not, as I always had the outof memory error. Did I miss anything important? Thanks. I am new to XML process.

A: 

Well you don't really process XML by the page, however if you extend XMLFilterImpl instead of using the DefaultElementHandler (whatever that is), then you can simply process the XML elements as they come. You will be streaming so there will be no case when the entire document is in memory (as a practical matter).

You will essentially get called for event element, at the start of the element, for the attributes, for the text within, and then at the end of the element (look at the methods in the ContentHandler interface). Based on these calls you do your processing (you will probably need to have some data structures where you accumulate the elements inside of your "page" element. Also note that there is no guarantee that you will get only one call for the text (it's up to the parser).

Does this help make it more clear?

Francis Upton
I do not really get your point...Would you please explain in a bit more details?
jason.Z
Done, hopefully it's more clear now.
Francis Upton
So am I doing right or wrong if I want to use the extended elementhandler like aforementioned.Why did I always get OOM?...
jason.Z
Make sure your sBuilder is actually getting reset; you can use a profiler to see what's happening to your memory, or if you don't have that, put some debugging code that shows the sBuilder is behaving how you would expect.
Francis Upton
Check what you are actually giving to the DOM (in your DocumentHelper.parseDocument() as well.
Francis Upton
A: 

I think it only read all the content within the element, as I just followed an example online...

public abstract class DefaultElementHandler extends DefaultHandler{ private boolean begin; private String tagName; private StringBuilder sBuilder;

public DefaultElementHandler(String tagName) {
 this.tagName = tagName;
 this.begin = false;
 this.sBuilder = new StringBuilder();
}

public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
 if (qName.equals(tagName)||begin){
  sBuilder.append("<");
  sBuilder.append(qName);
  sBuilder.append(" ");
  int attrCount = attributes.getLength();
  for (int i=0; i<attrCount; i++) {
   sBuilder.append(attributes.getQName(i));
   sBuilder.append("=\"");
   sBuilder.append(attributes.getValue(i));
   sBuilder.append("\" ");
  }
  sBuilder.append(">");
  begin = true;
 }
}

public void characters(char[] ch, int start, int length) throws SAXException{    
    StringBuilder sb = new StringBuilder();
    for(int i=0; i < length; i++) {
        sb.append(convertSpecialChar(ch[start+i]));
    }

    String text = sb.toString().trim();      
 //String text = new String(convertSpecialChar(ch), start, length);
 if (text.trim().equals("")) return;
 if (begin) sBuilder.append(text);
}

public void endElement(String uri, String localName, String qName) throws SAXException {
 String stag = "</" + tagName + ">";   
 String ntag = "</" + qName + ">";   
 if (stag.equals(ntag) || begin) {   
  sBuilder.append(ntag);   
  if (stag.equals(ntag)) {   
   begin = false;   
   try {   
    Document doc = DocumentHelper.parseText(sBuilder.toString());   
    Element element = doc.getRootElement();   
    this.processElement(element);   
   } catch (DocumentException e) {   
    e.printStackTrace();  
    System.exit(1);
   }   
   sBuilder.setLength(0);   
  }   
 }   
}
jason.Z
A: 

Your DefaultElement implementation looks confused to me. It looks like everything is piling up in sBuilder and it never gets cleared until it finds the end of the root element, or more likely, runs out of memory.

How to read in the element text depends on what kind of xml you need to parse. Each element can have text and it can be interspersed with child elements. Generally there is the kind of xml that you see in web services and config files, where all the element text is in the leaf elements, then there are cases, like XHTML, where the interspersing thing is going on.

If the way the schema of your xml works is that all the text information is in the leaf elements, then you can buffer the text you get starting with startElement, and use the accumulated text in endElement, then clear the buffer.

Here's a good article on SAX: http://www.javaworld.com/javaworld/jw-08-2000/jw-0804-sax.html

Nathan Hughes
Where is a proper place to clear the sBuilder? inside the elementHandler or outside it?
jason.Z
It would be inside the elementHandler, you just can't wait for the end of the document to clear it out.
Nathan Hughes
I found it successfully process around 1000 pages, then it got OOM...How to clear sBuilder? sBuilder.setLenghth(0)? or null?
jason.Z
setLength(0) will work.
Nathan Hughes