tags:

views:

68

answers:

3

Hi there,

Is there a way to accurately gather the byte offsets of xml tags using the XMLStreamReader?

I have a large xml file that I require random access to. Rather than writing the whole thing to a database, I would like to run through it once with an XMLStreamReader to gather the byte offsets of significant tags, and then be able to use a RandomAccessFile to retrieve the tag content later.

XMLStreamReader doesn't seem to have a way to track character offsets. Instead people recommend attaching the XmlStreamReader to a reader that tracks how many bytes have been read (the CountingInputStream provided by apache.commons.io, for example)

e.g:

CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;


while (xmlStreamReader.hasNext()) {
    int eventCode = xmlStreamReader.next();

    switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " @" + countingReader.getByteCount()) ;
    }

}
xmlStreamReader.close();

Unfortunately there must be some buffering going on, because the above code prints out the same byte offsets for several tags. Is there a more accurate way of tracking byte offsets in xml files (ideally without resorting to abandoning proper xml parsing)?

A: 

You could use a wrapper input stream around the actual input stream, simply deferring to the wrapped stream for actual I/O operations but keeping an internal counting mechanism with assorted code to retrieve current offset?

I didn't follow that exactly, but the CountingInputStream is just a wrapper around an unbuffered InputStream, that keeps a count of how many bytes has been read. It sounds like I'm already doing what you suggest?The problem is that the XMLStreamReader seems to be reading ahead and buffering somewhat. Say a there is an end tag 500 bytes into the file. The XMLStreamReader might fire the endElement event after reading 500 characters, or 501, or 600, whatever.
Dave
Yes you are right, not thinking properly. Anyway there is a getLocation() method in the interface which may work for you (if the parser supports proper Location objects). http://java.sun.com/webservices/docs/1.5/api/javax/xml/stream/XMLStreamReader.html#getLocation()In turn Location offers: getCharacterOffset()
A: 

I think I've found another option. If you replace your switch block with the following, it will dump the position immediately after the end element tag.

        switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " end@" + xmlStreamReader.getLocation().getCharacterOffset()) ;
        }

This solution also would require that the actual start position of the end tags would have to be manually calculated, and would have the advantage of not needing an external JAR file.

I was not able to track down some minor inconsistencies in the data management (I think it has to do with how I initialized my XMLStreamReader), but I always saw a consistent increase in the location as the reader moved through the content.

Hope this helps!

mlschechter
Hmm, this comes close, except that it returns character offsets rather than byte offsets. My data is encoded in utf8--variable length encoding--so there is no clean way to get byte offsets from this. >.<
Dave
A: 

You could use getLocation() on the XMLStreamReader (or XMLEvent.getLocation() if you use XMLEventReader), but I remember reading somewhere that it is not reliable and precise. And it looks like it gives the endpoint of the tag, not the starting location.

I have a similar need to precisely know the location of tags within a file, and I'm looking at other parsers to see if there is one that guarantees to give the necessary level of location precision.

This comes close, and I could handle getting the end of the tag rather than the start. The big problem is that it returns character offsets rather than byte offsets. My data is encoded in utf8 (variable length encoding) so there is no clean way to get byte offsets from this. >.<
Dave
The documentation for Location.getCharacterOffset() says "If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset."So maybe it is possible to construct and chain the streams so it actually will give you a byte offset. I'm thinking of something like feeding a file input stream into a byte array stream into the XML event stream?
That is starting to sound very dodgy, but good luck! I've officially given up and will simply store the content of the xml file in an indexed database (the idea was to have accurate byte offsets so that I could use a RandomAccessFile to read the parts of the xml file I needed, but that no longer seems like a good idea.
Dave