ansaurus

Question

How do I get the correct starting/ending locations of a xml tag with SAX?

Answer 1

+1 A:

What SAX parser are you using? Some, I am told, do not provide a Locator facility.

The output of the simple Python program below will give you the starting row and column number of every element in your XML file, e.g. if you indent two spaces in your XML:

Element: MyRootElem
starts at row 2 and column 0

Element: my_first_elem
starts at row 3 and column 2

Element: my_second_elem
starts at row 4 and column 4

Run like this: python sax_parser_filename.py my_xml_file.xml

#!/usr/bin/python

import sys
from xml.sax import ContentHandler, make_parser
from xml.sax.xmlreader import Locator

class MySaxDocumentHandler(ContentHandler):
    """
    the document handler class will serve 
    to instantiate an event handler which will 
    acts on various events coming from the parser
    """
    def __init__(self):
        self.setDocumentLocator(Locator())        

    def startElement(self, name, attrs):
        print "Element: %s" % name
        print "starts at row %s" % self._locator.getLineNumber(), \
            "and column %s\n" % self._locator.getColumnNumber()

    def endElement(self, name):
        pass

def mysaxparser(inFileName):
    # create a handler
    handler = MySaxDocumentHandler()
    # create a parser
    parser = make_parser()
    # associate our content handler to the parser
    parser.setContentHandler(handler)
    inFile = open(inFileName, 'r')
    # start parser
    parser.parse(inFile)
    inFile.close()

def main():
    mysaxparser(sys.argv[1])

if __name__ == '__main__':
    main()

Adam Bernier 2009-07-03 06:09:24

I am using the java one. I believe it has a working Locator. However somehow it just only shows the location of the last character of tags in startElement().

Winston Chen 2009-07-03 06:30:29

And thank you for your codes. They are cool.

Winston Chen 2009-07-03 06:36:21

Not a problem. I retagged the question to hopefully attract more interest from the Java-coding masses.

Adam Bernier 2009-07-03 07:05:09

Perhaps you'd also like to post a code sample to help folks diagnose the exact problem.

Adam Bernier 2009-07-03 07:05:54

Answer 2

+1 A:

Unfortunately, the Locator interface provided by the Java system library in the org.xml.sax package does not allow for more detailed information about the documentation location by definition. To quote from the documentation of the getColumnNumber method (highlights added by me):

The return value from the method is intended only as an approximation for the sake of diagnostics; it is not intended to provide sufficient information to edit the character content of the original XML document. For example, when lines contain combining character sequences, wide characters, surrogate pairs, or bi-directional text, the value may not correspond to the column in a text editor's display.

According to that specification, you will always get the position "of the first character after the text associated with the document event" based on best effort by the SAX driver. So the short answer to the first part of your question is: No, the Locator does not provide information about the start location of a tag. Also, if you are dealing with multi-byte characters in your documents, e.g., Chinese or Japanese text, the position you get from the SAX driver is probably not what you want.

If you are after exact positions for tags, or want even more fine grained information about attributes, attribute content etc., you'd have to implement your own location provider.

With all the potential encoding issues, Unicode characters etc. involved, I guess this is too big of a project to post here, the implementation will also depend on your specific requirements.

Just a quick warning from personal experience: Writing a wrapper around the InputStream you pass into the SAX parser is dangerous as you don't know when the SAX parser will report it's events based on what it has already read from the stream.

You could start by doing some counting of your own in the characters(char[], int, int) method of your ContentHandler by checking for line breaks, tabs etc. in addition to using the Locator information, which should give you a better picture of where in the document you actually are. By remembering the positions of the last event you could calculate the start position of the current one. Take into account though, that you might not see all line breaks, as those could appear inside tags which you would not see in characters, but you could deduce those from the Locator information.

Christian Hang 2009-07-05 02:39:35

It seems to be so. Too bad :(I think I will just do as what you said, or download the source code and modify that part when I have some free time to spend later.Thanks man!!

Winston Chen 2009-07-05 03:37:08

Answer 3

A:

Here comes a solution that I finally figured out. (But I was too lazy to put it up, sorry.) Here characters(), endElement() and ignorableWhitespace() methods are crutial, with a locator they points to possible starting point of your tags. The locator in characters() points to the cloest ending point of the non tag information, the locator in endElement() points to the ending position of the last tag, which will possibly be the starting point of this tag if they stick together, and the locator in ignorableWhitespace() points to the end of a series of white space and tab. As long as we keep track of the ending position of these three methods, we can find the starting point for this tag, and we can already get the ending position of this tag with the locator in endElement(). Therefore, the starting point and the ending point of a xml can be found successfully.

class Example extends DefaultHandler{
private Locator locator;
private SourcePosition startElePoint = new SourcePosition();

public void setDocumentLocator(Locator locator) {
    this.locator = locator;
}
/**
* <a> <- the locator points to here
* <b>
* </a>
*/
public void startElement(String uri, String localName, 
 String qName, Attributes attributes) {

}
/**
* <a>
* <b>
* </a> <- the locator points to here
*/
public void endElement(String uri, String localName, String qName)  {
 /* here we can get our source position */
 SourcePosition tag_source_starting_position = this.startElePoint;
 SourcePosition tag_source_ending_position = 
  new SourcePosition(this.location.getLineNumber(),
   this.location.getColumnNumber());

 // do your things here

 //update the starting point for the next tag
 this.updateElePoint(this.locator);
}

/**
* some other words <- the locator points to here
* <a>
* <b>
* </a>
*/
public void characters(char[] ch, int start, int length) {
 this.updateElePoint(this.locator);//update the starting point
}
/**
*the locator points to here-> <a>
*        <b>
*         </a>
*/
public void ignorableWhitespace(char[] ch, int start, int length) {
 this.updateElePoint(this.locator);//update the starting point
}
private void updateElePoint(Locator lo){
 SourcePosition item = new SourcePosition(lo.getLineNumber(), lo.getColumnNumber());
 if(this.startElePoint.compareTo(item)<0){
  this.startElePoint = item;
 }
}

class SourcePosition<SourcePosition> implements Comparable<SourcePosition>{
 private int line;
 private int column;
 public SourcePosition(){
  this.line = 1;
  this.column = 1;
 }
 public SourcePosition(int line, int col){
  this.line = line;
  this.column = col;
 }
 public int getLine(){
  return this.line;
 }
 public int getColumn(){
  return this.column;
 }
 public void setLine(int line){
  this.line = line;
 }
 public void setColumn(int col){
  this.column = col;
 }
 public int compareTo(SourcePosition o) {
  if(o.getLine() > this.getLine() || 
   (o.getLine() == this.getLine() 
    && o.getColumn() > this.getColumn()) ){
   return -1;
  }else if(o.getLine() == this.getLine() && 
   o.getColumn() == this.getColumn()){
   return 0;
  }else{
   return 1;
  }
 }
}

}

Winston Chen 2009-07-12 01:18:59

ansaurus

tags:

views:

answers:

How do I get the correct starting/ending locations of a xml tag with SAX?

related questions