views:

46

answers:

2

Hi All,

I am trying to split a large XML file into smaller files using java's SAXParser (specifically the wikipedia dump which is about 28GB uncompressed).

I have a Pagehandler class which extends DefaultHandler:

private class PageHandler extends DefaultHandler {

   private StringBuffer text;
   ...

  @Override
  public void startElement(String uri, String localName, String qName, Attributes attributes) {

        text.append("<" + qName + ">");
  }

  @Override
  public void endElement(String uri, String localName, String qName) {

        text.append("</" + qName + ">");

        if (qName.equals("page")) {
            text.append("\n");
            pageCount++;
            writePage();
        }

        if (pageCount >= maxPages) {
            rollFile();
        }
    }

  @Override
  public void characters(char[] chars, int start, int length) {
        for (int i = start; i < start + length; i++) {
            text.append(chars[i]);
        }
    }
}

So I can write out element content no problem. My problem is how to get the element tags and attributes - these characters do not seem to be reported. At best I will have to reconstruct these from what's passed as arguments to startElement - which seems a bit of a a pain. Or is there an easier way?

All I want to do is loop through the file and write it out, rolling the output file every-so-often. How hard can this be :)

Thanks

+1  A: 

I'm not quite sure I totally understand what you are trying to do but to get the qualified name as a string you simply do qName.toString() and to get the attributes name you just do atts.getQName(int index).

Octavian Damiean
thanks for this. Now my problem is that elements ontain xml character references which are being decoded by the parser - so I'm writing out ">" as opposed to >. Any idea how to work around this?
Richard
@Richard: if you use dom4j, as I suggested in my answer, it will automatically encode these special characters for you. It's another benefit of using a library instead of writing XML documents out yourself.
Richard Fearn
@Richard - yes agreed. thanks for this and your answer to my other question. I'm trying to echo directly without decoding then recoding if possible.
Richard
A: 

The problem here is that you're writing the XML elements out yourself. Have a look at the XMLWriter class of dom4j - while it's a little old, it makes it really easy to output XML documents by calling its startElement and endElement methods.

Richard Fearn