views:

67

answers:

3

I'm writing a SAX parser in Java to parse a 2.5GB XML file of wikipedia articles. Is there a way to monitor the progress of the parsing in Java?

+1  A: 

Assuming you know how many articles you have, can't you just keep a counter in the handler? E.g.

public void startElement (String uri, String localName, 
                          String qName, Attributes attributes) 
                          throws SAXException {
    if(qName.equals("article")){
        counter++
    }
    ...
}

(I don't know whether you are parsing "article", it's just an example)

If you don't know the number of article in advance, you will need to count it first. Then you can print the status nb tags read/total nb of tags, say each 100 tags (counter % 100 == 0).

Or even have another thread monitor the progress. In this case, you might want to synchronize access to the counter, but not necessary given that it doesn't need to be really accurate.

My 2 cents

ewernli
I figured that out, but I was looking for a way to do it without needing to count the articles first. I thought maybe there was a way to figure out the parser's position in the file instead, cause I can easily get the file size.
Danijel
+1  A: 

You can get an estimate of the current line/column in your file by overriding the method setDocumentLocator of org.xml.sax.helpers.DefaultHandler/BaseHandler. This method is called with an object from which you can get an approximation of the current line/column when needed.

Edit: To the best of my knowledge, there is no standard way to get the absolute position. However, I am sure some SAX implementations do offer this kind of information.

Po' Lazarus
Close, but then I would have to know the number of lines in the file, right?
Danijel
Indeed. Another idea might have pointed out by the enigmatic EJP. You can estimate the progress, by using the advancement in the input stream. However, this is not the progress in the parsing either, because of potential buffering and lookaheads.
Po' Lazarus
+3  A: 

javax.swing.ProgressMonitorInputStream

EJP
I think this will be close enough. Thanks!
Danijel