ansaurus

Question

Answer 1

+1 A:

You should manually take a look at the invalid documents and see what is the common problem to them. It's quite probable they are in fact in another encoding (most probably windows-1252), and the best solution then would be to take every document from the broken system and recode it to UTF-8 before parsing.

Another possible cause is mixed encodings (the content of some elements is in one encoding and the content of other elements is in another encoding). That would be harder to fix.

You would also need a way to know when the broken system gets fixed so you can stop using your workaround.

CesarB 2008-10-19 20:49:19

I suspect it is a case of mixed encodings (or just a few "rogue" characters) because other data from the same source works fine. It contains location names in Sweden so I suspect that they have some chars poorly encoded.

Burre 2008-10-19 21:07:52

Answer 2

+2 A:

if the problem truly is the wrong encoding (as opposed to a mixed encoding), you don't need to re-encode the document to parse it. just parse it as a Reader instead of an InputStream and the dom parser will ignore the header:

DocumentBuilder.parse(new InpputSource(new InputStreamReader(inputStream, "<real encoding>")));

james 2008-10-19 23:31:55

Thanks for the tip. It avoided the exception, unfortunately it didn't help me that much because it seems that the illicit chars are inside identifier strings I extract (and need), and those get the wrong encoding now. I think I'll just have to wait for the content providers to fix their error.

Burre 2008-10-20 09:29:01

Answer 3

A:

You should tell them to send you correct UTF-8. Failing that any solution should reencode the bad characters as valid UTF-8 then pass it to the parser. The reason for this is that if the bad characters are preserved then different programs might interpret any output different ways, which can lead to security holes.

Watson Ladd 2008-10-20 01:23:04

ansaurus

tags:

views:

answers:

Repairing wrong encoding in XML files

related questions