views:

1176

answers:

3

One of our providers are sometimes sending XML feeds that are tagged as UTF-8 encoded documents but includes characters that are not included in the UTF-8 charset. This causes the parser to throw an exception and stop building the DOM object when these characters are encountered:

DocumentBuilder.parse(ByteArrayInputStream bais)

throws the following exception:

org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence.

Is there a way to "capture" these problems early and avoid the exception (i.e. finding and removing those characters from the stream)? What I'm looking for is a "best effort" type of fallback for wrongly encoded documents. The correct solution would obviously be to attack the problem at the source and make sure that only correct documents are delivered, but what is a good approach when that is not possible?

+1  A: 

You should manually take a look at the invalid documents and see what is the common problem to them. It's quite probable they are in fact in another encoding (most probably windows-1252), and the best solution then would be to take every document from the broken system and recode it to UTF-8 before parsing.

Another possible cause is mixed encodings (the content of some elements is in one encoding and the content of other elements is in another encoding). That would be harder to fix.

You would also need a way to know when the broken system gets fixed so you can stop using your workaround.

CesarB
I suspect it is a case of mixed encodings (or just a few "rogue" characters) because other data from the same source works fine. It contains location names in Sweden so I suspect that they have some chars poorly encoded.
Burre
+2  A: 

if the problem truly is the wrong encoding (as opposed to a mixed encoding), you don't need to re-encode the document to parse it. just parse it as a Reader instead of an InputStream and the dom parser will ignore the header:

DocumentBuilder.parse(new InpputSource(new InputStreamReader(inputStream, "<real encoding>")));
james
Thanks for the tip. It avoided the exception, unfortunately it didn't help me that much because it seems that the illicit chars are inside identifier strings I extract (and need), and those get the wrong encoding now. I think I'll just have to wait for the content providers to fix their error.
Burre
A: 

You should tell them to send you correct UTF-8. Failing that any solution should reencode the bad characters as valid UTF-8 then pass it to the parser. The reason for this is that if the bad characters are preserved then different programs might interpret any output different ways, which can lead to security holes.

Watson Ladd