views:

72

answers:

1

First things first, I can not change the output of the xml, it is being produced by a third party. They are inserting invalid characters in the the xml. I am given a InputStream of the byte stream representation of the xml. Is their a cleaner way to filter out the offending characters besides consuming the stream into a String and processing it? I found this: using a FilterReader but that doesn't work for me as I have a byte stream and not a character stream.

For what it's worth this is all part of a jaxb unmarshalling procedure, just in case that offers options.

We aren't willing to toss the whole stream if it has bad characters. We have decided to remove them and carry on.

Here is a FilterReader I tried to build.

public class InvalidXMLCharacterFilterReader extends FilterReader
{

private static final Log LOG = LogFactory
.getLog(InvalidXMLCharacterFilterReader.class);

public InvalidXMLCharacterFilterReader(Reader in)
{
    super(in);
}

public int read() throws IOException {
    char[] buf = new char[1];
    int result = read(buf, 0, 1);
    if (result == -1)
      return -1;
    else
      return (int) buf[0];
}

public int read(char[] buf, int from, int len) throws IOException {
    int count = 0;
    while (count == 0) {
        count = in.read(buf, from, len);
        if (count == -1)
            return -1;

        int last = from;
        for (int i = from; i < from + count; i++) {
            LOG.debug("" + (char)buf[i]);
            if(!isBadXMLChar(buf[i]))
            {
                buf[last++] = buf[i];
            }
        }

        count = last - from;
    }
    return count;
}

private boolean isBadXMLChar(char c)
{
    if ((c == 0x9) ||
        (c == 0xA) ||
        (c == 0xD) ||
        ((c >= 0x20) && (c <= 0xD7FF)) ||
        ((c >= 0xE000) && (c <= 0xFFFD)) ||
        ((c >= 0x10000) && (c <= 0x10FFFF)))
    {
        return false;
    }
    return true;
}

}

And here is how I am unmarshalling it:

jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);

and some example bad xml

<?xml version="1.0" encoding="UTF-8" ?>
<foo>
    bar&#x01;
</foo>
A: 

In order to do this with a filter, the filter needs to be XML entity aware, because (at least in your example and likely sometimes in actual use) the bad characters are in the xml as entities.

The filter is seeing your entity as a sequence of 6 perfectly acceptable characters and thus not stripping them.

The conversion that breaks JAXB is happening later in the process.

Don Roby
Right. So got any ideas about an entity aware filter? Or is my only option to just suck it in to a buffer and .replaceAll() the crap out of it?
DanInDC
I'm sure I've seen example FilterReader code somewhere to filter by regular expressions. Can't put my hands on it at the moment, but a google might find something.It does basically amount to "suck it into a buffer and .replaceAll() the crap out of it", but within the filter code.
Don Roby