Hi there,
Is there a way to accurately gather the byte offsets of xml tags using the XMLStreamReader?
I have a large xml file that I require random access to. Rather than writing the whole thing to a database, I would like to run through it once with an XMLStreamReader to gather the byte offsets of significant tags, and then be able to use a RandomAccessFile to retrieve the tag content later.
XMLStreamReader doesn't seem to have a way to track character offsets. Instead people recommend attaching the XmlStreamReader to a reader that tracks how many bytes have been read (the CountingInputStream provided by apache.commons.io, for example)
e.g:
CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " @" + countingReader.getByteCount()) ;
}
}
xmlStreamReader.close();
Unfortunately there must be some buffering going on, because the above code prints out the same byte offsets for several tags. Is there a more accurate way of tracking byte offsets in xml files (ideally without resorting to abandoning proper xml parsing)?