tags:

views:

27

answers:

1

I have an XMLEventReader. It has been built from an XMLInputFactory with the "UTF8" encoding. I am using it to read an XML file whose "encoding" attribute is set to "UTF-8".

I have verified that the XML file views correctly under Firefox. When you view the page encoding, it says that it is UTF-8.

I have set the XMLEventReader to coalesce character events like this:

reader.setProperty(XMLEventReader.IS_COALESCING, Boolean.TRUE);

The XML document does not have a DTD. It is valid.

The XMLEventReader will occasionally report that a CHARACTERS event has been received whose content is (minus the quotation marks), for example:

r problems were most severe and frequent.) Did you sleep a lot more than usual nearly every night during that period?</text>  Ð 

Note the presence of the markup tag near the end of the sample, as well as the capital thorn. Note also that the sentence has been lopped off; presumably there was another CHARACTERS event before this one that contains the leading part of the sentence.

Why does the XMLEventReader screw up the parsing? Why are the characters not displaying correctly? Why does the XMLEventReader not coalesce CHARACTERS events, if that's what's going on? Why is StAX so unbelievably festeringly ugly and unpredictable?

I am using the XMLEventReader supplied to me by my Java runtime (Java 6) on a Mac.

Here is some sample XML, which of course I've simply copied from my editor, so who knows what character conversions occurred as a result of that, but anyhow:

<question id="BMHPD17">
  <permittedResponseCount>1</permittedResponseCount>
  <text>It’s hard for me to stay out of trouble. (Would you say this is true or false for you?)</text>
  <namedAnswerSet idref="TrueFalse"></namedAnswerSet>
</question>

Note the "smart apostrophe" on line 3.

I am reading this by reacting to a CHARACTERS event, saving its contents to a String on the stack, then reacting to an END_ELEMENT event whose name is "question". Upon receiving the END_ELEMENT event for question, I retrieve the value of the String I just mentioned, and construct a Java object that takes the string I just mentioned as input.

When I System.out.println() the result, I get (sometimes) the bogus junk I referred to earlier.

When I wrap System.out inside a PrintWriter with "UTF8" encoding set, so that I'm not simply outputting characters according to the platform's encoding, I get the same results.

A: 

Is this even the same as the underlying SAX event, which includes a start offset and length? If so, you will probably find these specify a region of the string that excludes the markup.

dty
I'm sorry, I don't understand. I'm talking about stax, not SAX. Stax XMLEvents aren't to my knowledge related to SAX in any way. Also, I'm getting partial--and garbled--markup out, so it isn't like somehow it's a simple case of accidentally including all the markup around the CHARACTERS content. Something in the stream is triggering the reader to think that a new element has started, or ended, or something; I suspect there is something related to encoding going on here, but I can't find the link in the chain.
Laird Nelson
I've looked at the StAX API, and it's not like the SAX one. In the SAX API, the `characters()` callback has a `char[]` as well as an `offset` and `length`, so sometimes you get extraneous characters in the `char[]`, but that doesn't seem to be the case with StAX. I think you will need to post more of the XML. Perhaps it's not well-formed?
dty
Is the file REALLY UTF8? I mean, just because it claims to be, doesn't mean it is. And just because Firefox says it is, it could just be using the declared encoding (since detecting encodings is practically impossible). If you're using some 8-bit ISO encoding, your apostrophe will be a top bit set byte which will very likely be throwing off the stream decoder when it tries to decode that byte and some following ones as a UTF-8 sequence.
dty