views:

881

answers:

5

SAX keeps on dying on the following exception:

Invalid byte 2 of 3-byte UTF-8 sequence

The problem is its mostly correctly UTF-8 encoded but there are a few errors in it. We cannot get a new version of the file, we have to use this file.

So how do we tell SAX to ignore invalid character sequences, or clean up the UTF-8 file so that it doesn't have invalid UTF-8 sequences?

+2  A: 

You could filter the stream before SAX reads it. Create an InputStream which reads your stream and drops invalid characters.

Peter Lawrey
I guess it has to be said for some people, but this is kind of stating the obvious. (:
corydoras
+2  A: 

I would suggest that you clean up the file as a completely separate step from parsing it as XML.

UTF-8 is a fairly easy encoding to understand; this web page shows how UTF-8 is meant to be formed. I suggest you write a program which reads in your input file and writes out a new file. It will read byte by byte, only writing out a character when it sees that it has been validly formed. When it sees an invalid byte, it would write out the string "UTF8ERROR" or some other easily-findable token which wouldn't occur naturally in the input data. It would then skip the rest of the character.

Afterwards, you can check where the errors have occurred and fix up the data... then parse it as normal.

This way you'll see how widespread the errors are, see if there's any pattern to them, and potentially be able to correct them. If you're going to receive more data from the same source, I'd strongly encourage you to tell them about the issue... it may indicate a more serious problem on their side.

Jon Skeet
SO basically your confirming what I hoped not to have to do. We have to write our own UTF-8 sanitizers.
corydoras
I'm afraid so. There may be similar things available on the net already, but I don't know of any.
Jon Skeet
Looking for other people with your exact problem suggests that you might have your encoding specified wrong. If this is the case, it could be a quick fix. Check here: http://www.openrdf.org/doc/sesame/users/ch09.html#d0e3707
Gunslinger47
Sorry gunslinger, not a quick fix. The problem is there are non utf-8 characters mixed into a utf-8 encoded file.
corydoras
+2  A: 

SAX (and other XML tools) are designed to work on well-formed (or when required valid) XML. They deliberately throw errors or exceptions when the input is not well-formed including failure to conform to an encoding. So as other answers have suggested you have to use a separate step to clean up the input.

(Similarly SAX will throw errors with HTML which is not well-formed XML, such as missing end-tags).

peter.murray.rust
A: 

I guess this won't help you much, but maybe others would like to know:

I recently got the same exception when retrieving an UTF-8 XML file that was served with ISO-8859-1 headers. The solution was to specify UTF-8 manually via String.getBytes(charset):

public Document parseRequest(HttpServletRequest request) {
   DocumentBuilderFactory builder = DocumentBuilderFactory.newInstance();

   DataInputStream dataStream = new DataInputStream(request.getInputStream());
   String xml = dataStream.readUTF();
   ByteArrayInputStream byteStream = new ByteArrayInputStream(xml.getBytes("UTF-8"));
   return builder.newDocumentBuilder().parse(byteStream);
}

EDIT: .. or even simpler:

public Document parseRequest(HttpServletRequest request) {
   DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();

   Reader reader = new InputStreamReader(request.getInputStream(), "UTF-8");
   InputSource source = new InputSource(reader);
   return domFactory.newDocumentBuilder().parse(source);
}
neu242
Indeed useful information for some people, but you are correct, this doesn't fix the problem of mixed encoding stored into a single file.
corydoras
A: 

Could you use java.nio.charset.CharsetDecoder together with InputStreamReader(InputStream in, CharsetDecoder dec) somehow?

How a decoding error is handled depends upon the action requested for that type of error, which is described by an instance of the CodingErrorAction class. The possible error actions are to ignore the erroneous input, report the error to the invoker via the returned CoderResult object, or replace the erroneous input with the current value of the replacement string. The replacement has the initial value "\uFFFD"; its value may be changed via the replaceWith method.

(from the CharsetDecoder javadoc)

neu242
Interesting idea, I am not sure.
corydoras