ansaurus

Question

Why does XMLEventReader report a CHARACTERS event that contains markup?

Answer 1

A:

Is this even the same as the underlying SAX event, which includes a start offset and length? If so, you will probably find these specify a region of the string that excludes the markup.

dty 2010-09-09 19:55:44

I'm sorry, I don't understand. I'm talking about stax, not SAX. Stax XMLEvents aren't to my knowledge related to SAX in any way. Also, I'm getting partial--and garbled--markup out, so it isn't like somehow it's a simple case of accidentally including all the markup around the CHARACTERS content. Something in the stream is triggering the reader to think that a new element has started, or ended, or something; I suspect there is something related to encoding going on here, but I can't find the link in the chain.

Laird Nelson 2010-09-09 20:56:25

I've looked at the StAX API, and it's not like the SAX one. In the SAX API, the `characters()` callback has a `char[]` as well as an `offset` and `length`, so sometimes you get extraneous characters in the `char[]`, but that doesn't seem to be the case with StAX. I think you will need to post more of the XML. Perhaps it's not well-formed?

dty 2010-09-09 21:17:09

Is the file REALLY UTF8? I mean, just because it claims to be, doesn't mean it is. And just because Firefox says it is, it could just be using the declared encoding (since detecting encodings is practically impossible). If you're using some 8-bit ISO encoding, your apostrophe will be a top bit set byte which will very likely be throwing off the stream decoder when it tries to decode that byte and some following ones as a UTF-8 sequence.

dty 2010-09-09 22:55:00

ansaurus

tags:

views:

answers:

Why does XMLEventReader report a CHARACTERS event that contains markup?

related questions