ansaurus

Question

Converting document encoding when reading with dom4j

Answer 1

+2 A:

This is done automatically by dom4j. All String instances in Java are in a common, decoded form; once a String is created, it isn't possible to tell what the original character encoding was (or even if the string was created from encoded bytes).

Just make sure that the XML document has the character encoding specified (which is required unless it is UTF-8).

erickson 2009-06-11 16:45:33

Very true, dom4j does that automatically. My tests didn't pass because the files being parsed were set in Eclipse project properties to have UTF-8 encoding. 6 hours spent debugging this... I hate programming, sometimes :(

Michał Paluchowski 2009-06-12 09:03:20

Answer 2

A:

The decoding happens in (or before) the InputSource (before the SAXReader). From that class's javadocs:

The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.

So it depends on how you are creating the InputSource. To guarantee the proper decoding you can use something like the following:

InputStream stream = <input source>
Charset charset = Charset.forName("ISO-8859-2");
Reader reader = new BufferedReader(new InputStreamReader(stream, charset));
InputSource source = new InputSource(reader);

Kathy Van Stone 2009-06-11 16:50:00

This is assuming that the default action of the InputSource was getting confused.

Kathy Van Stone 2009-06-11 18:00:23

ansaurus

tags:

views:

answers:

Converting document encoding when reading with dom4j

related questions