I'm writing a SAX parser in Java to parse a 2.5GB XML file of wikipedia articles. Is there a way to monitor the progress of the parsing in Java?
Assuming you know how many articles you have, can't you just keep a counter in the handler? E.g.
public void startElement (String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
if(qName.equals("article")){
counter++
}
...
}
(I don't know whether you are parsing "article", it's just an example)
If you don't know the number of article in advance, you will need to count it first. Then you can print the status nb tags read/total nb of tags
, say each 100 tags (counter % 100 == 0
).
Or even have another thread monitor the progress. In this case, you might want to synchronize access to the counter, but not necessary given that it doesn't need to be really accurate.
My 2 cents
You can get an estimate of the current line/column in your file by overriding the method setDocumentLocator
of org.xml.sax.helpers.DefaultHandler/BaseHandler
. This method is called with an object from which you can get an approximation of the current line/column when needed.
Edit: To the best of my knowledge, there is no standard way to get the absolute position. However, I am sure some SAX implementations do offer this kind of information.