I am working with potentially huge XML files containing complex trace information from on of my projects.
I would like to build indexes for those XML files so that one can quickly find sub sections of the XML document without having to load it all into memory.
If I have created a "shelve" index that could contains information like "books for author Joe" are at offsets [22322, 35446, 54545] then I can just open the xml file like a regular text file and seek to those offsets and then had that to one of the DOM parser that takes a file or strings.
The part that I have not figured out yet is how to quickly parse the XML and create such an index.
So what I need as a fast SAX parser that allows me to find the start offset of tags in the file together with the start events. So I can parse a subsection of the XML together with the starting point into the document, extract the key information and store the key and offset in the shelve index.