views:

76

answers:

3

We're parsing an XML document using JAXB and get this error:

[org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.]
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)

What exactly does this mean and how can we resolve this??

We are executing the code as:

jaxbContext = JAXBContext.newInstance(Results.class);
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
unmarshaller.setSchema(getSchema());
results = (Results) unmarshaller.unmarshal(new FileInputStream(inputFile));

Update

Issue appears to be due to this "funny" character in the XML file: ¿

Why would this cause such a problem??

Update 2

There are two of those weird characters in the file. They are around the middle of the file. Note that the file is created based on data in a database and those weird characters somehow got into the database.

Update 3

Here is the full XML snippet:

<Description><![CDATA[Mt. Belvieu ¿ Texas]]></Description>

Update 4

Note that there is no <?xml ...?> header.

The HEX for the special character is BF

+1  A: 

That's probably a Byte Order Mark (BOM), and is a special byte sequence at the start of a UTF file. They are, frankly, a pain in the arse, and seem particularly common when interacting with .net systems.

Try rephrasing your code to use a Reader rather than an InputStream:

results = (Results) unmarshaller.unmarshal(new FileReader(inputFile));

A Reader is UTF-aware, and might make a better stab at it. More simply, pass the File directly to the Unmarshaller, and let the JAXBContext worry about it:

results = (Results) unmarshaller.unmarshal(inputFile);
skaffman
I can try that. Note that there are two of those characters in the file - see the second update to the post.
Marcus
Using the `FileReader` looks good. Got the same error when I just specified the `File`. Going to validate all my results but this looks good!
Marcus
But so I understand, these just seem like "weird" characters, not a "Byte Order Mark", no? Why do they cause this trouble?
Marcus
@Marcus: Well, the BOM *is* a sequence of weird characters, denending on how you look at them.
skaffman
A: 

It sounds as if your XML is encoded with UTF-16 but that encoding is not getting passed to the Unmarshaller. With the Marshaller you can set that using marshaller.setProperty(Marshaller.JAXB_ENCODING, "UTF-16"); but because the Unmarshaller is not required to support any properties, I am not sure how to enforce that other than ensuring your XML document has encoding="UTF-16" in the initial <?xml?> element.

Andy
@Andy: It can't be UTF-16 - attempt to parse UTF-16-encoded XML file as UTF-8 will fail due to wrong markup. It's probably some single-byte encoding.
axtavt
You are correct. I was looking at the different encodings and got mixed up.
Andy
+2  A: 

So, you problem is that JAXB treats XML files without <?xml ...?> header as UTF-8, when your file uses some other encoding (probably ISO-8859-1 or Windows-1252, if 0xBF character actually intended to mean ¿).

If you can change the producer of the file, you may add <?xml ...?> header with actual encoding specification, or just use UTF-8 to write a file.

If you can't change the producer, you have to use InputStreamReader with explicit encoding specification, because (unfortunately) JAXB don't allow to change its default encoding:

results = (Results) unmarshaller.unmarshal(
   new InputStreamReader(new FileInputStream(inputFile), "ISO-8859-1")); 

However, this solution is fragile - it fails on input files with <?xml ...?> header with different encoding specification.

axtavt
Thanks, will try. Note that I get the same error when I use Xalan/Java to try and format the XML file using XSLT. Does Xalan also assume UTF-8?
Marcus
That works! Note that this code only runs on this file which will never have the xml header. What is the advantage/diff to this approach vs. using this: `results = (Results) unmarshaller.unmarshal(new FileReader(inputFile));`
Marcus
@Marcus: `FileReader` uses the system default encoding, when `InputStreamReader` uses the explicitly specified one.
axtavt