First things first, I can not change the output of the xml, it is being produced by a third party. They are inserting invalid characters in the the xml. I am given a InputStream of the byte stream representation of the xml. Is their a cleaner way to filter out the offending characters besides consuming the stream into a String and processing it? I found this: using a FilterReader but that doesn't work for me as I have a byte stream and not a character stream.
For what it's worth this is all part of a jaxb unmarshalling procedure, just in case that offers options.
We aren't willing to toss the whole stream if it has bad characters. We have decided to remove them and carry on.
Here is a FilterReader I tried to build.
public class InvalidXMLCharacterFilterReader extends FilterReader
{
private static final Log LOG = LogFactory
.getLog(InvalidXMLCharacterFilterReader.class);
public InvalidXMLCharacterFilterReader(Reader in)
{
super(in);
}
public int read() throws IOException {
char[] buf = new char[1];
int result = read(buf, 0, 1);
if (result == -1)
return -1;
else
return (int) buf[0];
}
public int read(char[] buf, int from, int len) throws IOException {
int count = 0;
while (count == 0) {
count = in.read(buf, from, len);
if (count == -1)
return -1;
int last = from;
for (int i = from; i < from + count; i++) {
LOG.debug("" + (char)buf[i]);
if(!isBadXMLChar(buf[i]))
{
buf[last++] = buf[i];
}
}
count = last - from;
}
return count;
}
private boolean isBadXMLChar(char c)
{
if ((c == 0x9) ||
(c == 0xA) ||
(c == 0xD) ||
((c >= 0x20) && (c <= 0xD7FF)) ||
((c >= 0xE000) && (c <= 0xFFFD)) ||
((c >= 0x10000) && (c <= 0x10FFFF)))
{
return false;
}
return true;
}
}
And here is how I am unmarshalling it:
jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);
and some example bad xml
<?xml version="1.0" encoding="UTF-8" ?>
<foo>
bar
</foo>