ansaurus

Question

Howto let the SAX parser determine the encoding from the xml declaration?

Answer 1

A:

I found the answer myself.

The SAX parser uses InputSource internally and from the InputSource docs:

The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.

So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);

Allan 2010-08-14 09:43:00

Constructing an InputStreamReader without specifying a charset will use the default charset of your machine, which probably is iso-8859-1. As you quoted, the encoding decl in the xml will be ignored when using a characterstream so this code will only work with iso-8859-1 documents. You original code should actually have worked, maybe you could add the exception or the exact problem you are seeing to your question. When using a byte stream and not setting the encoding on the InputSource the xml parser should autodetect the encoding as described in http://www.w3.org/TR/REC-xml/#sec-guessing.

Jörn Horstmann 2010-08-14 10:38:41

Basically I get an invalid token exception if I don't use "is.setCharacterStream()".

Allan 2010-08-15 20:57:19

ansaurus

tags:

views:

answers:

Howto let the SAX parser determine the encoding from the xml declaration?

related questions