ansaurus

Question

Tell SAX Parser to ignore invalid characters?

Answer 1

+2 A:

You could filter the stream before SAX reads it. Create an InputStream which reads your stream and drops invalid characters.

Peter Lawrey 2009-10-19 06:02:41

I guess it has to be said for some people, but this is kind of stating the obvious. (:

corydoras 2009-10-19 06:34:51

Answer 2

+2 A:

I would suggest that you clean up the file as a completely separate step from parsing it as XML.

UTF-8 is a fairly easy encoding to understand; this web page shows how UTF-8 is meant to be formed. I suggest you write a program which reads in your input file and writes out a new file. It will read byte by byte, only writing out a character when it sees that it has been validly formed. When it sees an invalid byte, it would write out the string "UTF8ERROR" or some other easily-findable token which wouldn't occur naturally in the input data. It would then skip the rest of the character.

Afterwards, you can check where the errors have occurred and fix up the data... then parse it as normal.

This way you'll see how widespread the errors are, see if there's any pattern to them, and potentially be able to correct them. If you're going to receive more data from the same source, I'd strongly encourage you to tell them about the issue... it may indicate a more serious problem on their side.

Jon Skeet 2009-10-19 06:06:43

SO basically your confirming what I hoped not to have to do. We have to write our own UTF-8 sanitizers.

corydoras 2009-10-19 06:36:36

I'm afraid so. There may be similar things available on the net already, but I don't know of any.

Jon Skeet 2009-10-19 06:41:30

Looking for other people with your exact problem suggests that you might have your encoding specified wrong. If this is the case, it could be a quick fix. Check here: http://www.openrdf.org/doc/sesame/users/ch09.html#d0e3707

Gunslinger47 2009-10-19 06:51:11

Sorry gunslinger, not a quick fix. The problem is there are non utf-8 characters mixed into a utf-8 encoded file.

corydoras 2009-10-19 22:34:39

Answer 3

+2 A:

SAX (and other XML tools) are designed to work on well-formed (or when required valid) XML. They deliberately throw errors or exceptions when the input is not well-formed including failure to conform to an encoding. So as other answers have suggested you have to use a separate step to clean up the input.

(Similarly SAX will throw errors with HTML which is not well-formed XML, such as missing end-tags).

peter.murray.rust 2009-10-19 06:10:38

Answer 4

A:

I guess this won't help you much, but maybe others would like to know:

I recently got the same exception when retrieving an UTF-8 XML file that was served with ISO-8859-1 headers. The solution was to specify UTF-8 manually via String.getBytes(charset):

public Document parseRequest(HttpServletRequest request) {
   DocumentBuilderFactory builder = DocumentBuilderFactory.newInstance();

   DataInputStream dataStream = new DataInputStream(request.getInputStream());
   String xml = dataStream.readUTF();
   ByteArrayInputStream byteStream = new ByteArrayInputStream(xml.getBytes("UTF-8"));
   return builder.newDocumentBuilder().parse(byteStream);
}

EDIT: .. or even simpler:

public Document parseRequest(HttpServletRequest request) {
   DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();

   Reader reader = new InputStreamReader(request.getInputStream(), "UTF-8");
   InputSource source = new InputSource(reader);
   return domFactory.newDocumentBuilder().parse(source);
}

neu242 2009-11-04 14:36:48

Indeed useful information for some people, but you are correct, this doesn't fix the problem of mixed encoding stored into a single file.

corydoras 2009-11-05 22:59:24

Answer 5

A:

Could you use java.nio.charset.CharsetDecoder together with InputStreamReader(InputStream in, CharsetDecoder dec) somehow?

How a decoding error is handled depends upon the action requested for that type of error, which is described by an instance of the CodingErrorAction class. The possible error actions are to ignore the erroneous input, report the error to the invoker via the returned CoderResult object, or replace the erroneous input with the current value of the replacement string. The replacement has the initial value "\uFFFD"; its value may be changed via the replaceWith method.

(from the CharsetDecoder javadoc)

neu242 2009-11-06 08:31:37

Interesting idea, I am not sure.

corydoras 2009-11-08 21:32:42

ansaurus

tags:

views:

answers:

Tell SAX Parser to ignore invalid characters?

related questions