ansaurus

Question

Parsing big XML file with SAX parser, the class is becoming bloated and unreadable - How to fix this ?

Answer 1

A:

It depends on the XML structure. If the actions for different cases are easy or (more or less) "independent", you could try to use a map:

interface Command {
   public void assemble(Attributes attr, MyStructure myStructure);
}
...

Map<String, Command> commands= new HashMap<String, Command>();
...
if(commands.contains(qName)) {
   commands.get(qname).assemble(attr, myStructur);
} else {
   //unknown qName
}

Landei 2010-08-26 10:00:55

So if I follow this right, I have to implement a different assemble method for each qName ? This would indeed reduce my list of if/else statement to 6 lines, but I don't really know how you would go to have different actions for each command, any example ?

SAKIROGLU Koray 2010-08-26 12:32:06

It is hard to say if this style is actually better than the if-else-cascade. It really all depends on what you need to do in every case. If you build one big object or have to switch some setting, the Map style is probably better. If you have complicated interactions between the cases or need a lot of intermediate data structures, it gets probably too messy.

Landei 2010-08-26 12:47:26

Answer 2

A:

I would fallback to JAXB or something equivalent and let the framework do the work.

Claude Vedovini 2010-08-26 10:08:16

Can you be more specific ? I don't really know much about JAXB, Is it as efficient or more efficient than SAX ? How would this solve my problem ? If I understand right what I read about it, I would only have to create a schema for the mapping and then let JAXB do all the parsing ?

SAKIROGLU Koray 2010-08-27 12:05:37

use JAXB to create a Java DOM from the schema then use it to parse documents into DOM instances. Internally it uses SAX to parse so it should be about the same performance wise. However at the end of the parsing you'll get an in-memory object tree so depending of what you were actually doing before you might end up with more memory needed. And this might be a problem because the bigger the document the more memory you'll need.

Claude Vedovini 2010-08-31 20:49:00

Answer 3

+1 A:

What do you use the boolean variables for? To keep track of nesting?

I recently implemented this by using an enum for every element. The code is at work but this is a rough approximation of it off the top of my head:

enum Element {
   // special markers:
   ROOT,
   DONT_CARE,

   // Element               tag                  parents
   RootElement(             "root"               ROOT),
   AnElement(               "anelement"),     // DONT_CARE
   AnotherElement(          "anotherelement"),// DONT_CARE
   AChild(                  "child",             AnElement),
   AnotherChild(            "child",             AnotherElement);

   Element() {...}
   Element(String tag, Element ... parents) {...}
}

class MySaxParser extends DefaultHandler {
    Map<Pair<Element, String>, Element> elementMap = buildElementMap();
    LinkedList<Element> nestingStack = new LinkedList<Element>();

    public void startElement(String namespaceURI, String sName, String qName, Attributes attrs) {
        Element parent = nestingStack.isEmpty() ? ROOT : nestingStack.lastElement();
        Element element = elementMap.get(pair(parent, sName));
        if (element == null)
            element = elementMap.get(DONT_CARE, sName);
        if (element == null)
            throw new IllegalStateException("I did not expect <" + sName + "> in this context");

        nestingStack.addLast(element);

        switch (element) {
        case RootElement: ... // Probably don't need cases for many elements at start unless we have attributes
        case AnElement: ...
        case AnotherElement: ...
        case AChild: ...
        case AnotherChild: ...
        default: // Most cases here. Generally nothing to do on startElement
        }
    }
    public void endElement(String namespaceURI, String sName, String qName) {
        // Similar to startElement() but most switch cases do something with the data.
        Element element = nestingStack.removeLast();
        if (!element.tag.equals(sName)) throw IllegalStateException();
        switch (element) {
           ...
        }
    }

    // Construct the structure map from the parent information.
    private Map<Pair<Element, String>, Element> buildElementMap() {
        Map<Pair<Element, String>, Element> result = new LinkedHashMap<Pair<Element, String>, Element>();
        for (Element element: Element.values()) {
            if (element.tag == null) continue;
            if (element.parents.length == 0)
                result.put(pair(DONT_CARE, element.tag), element);
            else for (Element parent: element.parents) {
                result.put(pair(parent, element.tag), element);
            }
        }
        return result;
    }
    // Convenience method to avoid the need for using "new Pair()" with verbose Type parameters 
    private <A,B> Pair<A,B> pair(A a, B b) {
        return new Pair<A, B>(a, b);
    }
    // A simple Pair class, just for completeness.  Better to use an existing implementation.
    private static class Pair<A,B> {
        final A a;
        final B b;
        Pair(A a, B b){ this.a = a; this.b = b;}
        public boolean equals(Object o) {...};
        public int hashCode() {...};
    }
}

Edit:
The position within the XML structure is tracked by a stack of elements. When startElement is called, the appropriate Element enum can be determined by using 1) the parent element from the tracking stack and 2) the element tag passed as the sName parameter as the key to a Map generated from the parent information defined as part of the Element enum. The Pair class is simply a holder for the 2-part key.

This approach allows the same element-tag that appears repeatedly in different parts of the XML structure with different semantics to be represented by different Element enums. For example:

<root>
  <anelement>
    <child>Data pertaining to child of anelement</child>
  </anelement>      
  <anotherelement>
    <child>Data pertaining to child of anotherelement</child>
  </anotherelement>
</root>

Using this technique, we don't need to use flags to track the context so that we know which <child> element is being processed. The context is declared as part of the Element enum definition and reduces confusion by eliminating assorted state variables.

Adrian Pronk 2010-08-26 11:03:27

I understand using a linked map for tracking my position in the XML file, but what is the purpose of the Pair ?

SAKIROGLU Koray 2010-08-30 08:30:16

@SAKIROGLU Koray: The Pair is simply a holder for a two-value Map key. I've edited my answer to clarify that.

Adrian Pronk 2010-08-30 20:33:21

ansaurus

tags:

views:

answers:

Parsing big XML file with SAX parser, the class is becoming bloated and unreadable - How to fix this ?

related questions