views:

801

answers:

3

I have to handle this scenario in Java:

I'm getting a request in XML form from a client with declared encoding=utf-8. Unfortunately it may contain not utf-8 characters and there is a requirement to remove these characters from the xml on my side (legacy).

Let's consider an example where this invalid XML contains £ (pound).

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character? Any potential issues?

2) I get xml as an array of bytes - how to handle this operation safely in that case?

+1  A: 

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character?

I am assuming that you rather mean that you want to get rid of non-ASCII characters, because you're talking about a "legacy" side. You can get rid of anything outside the printable ASCII range using the following regex:

string = string.replaceAll("[^\\x20-\\x7e]", "");

2) I get xml as an array of bytes - how to handle this operation safely in that case?

You need to wrap the byte[] in an ByteArrayInputStream, so that you can read them in an UTF-8 encoded character stream using InputStreamReader wherein you specify the encoding and then use a BufferedReader to read it line by line.

E.g.

BufferedReader reader = null;
try {
    reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String line; (line = reader.readLine()) != null;) {
        line = line.replaceAll("[^\\x20-\\x7e]", "");
        // ...
    }
    // ...
BalusC
+2  A: 

UTF-8 is an encoding; Unicode is a character set. But the GBP symbol is most definitely in the Unicode character set and therefore most certainly representable in UTF-8.

If you do in fact mean UTF-8, and you are actually trying to remove byte sequences that are not the valid encoding of a character in UTF-8, then...

CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
ByteBuffer bytes = ...;
CharBuffer parsed = utf8Decoder.decode(bytes);
...
Sean Owen
A: 

For unknown reason I cannot add a coment, but as I wrote I meant non-utf8 not non-ascii. I have 2 non-unicode characters (i don't have them in front of me and I cannot check right now) to remove or replace - pound was really confusing, it's utf-8 (I haven't checked it).

@Sean Owen.

Thanks for the example. Based on that I could get rid of unwanted characters, but I'm wondering how to do it when I get a String and I decide to replace

first non-utf8 -> A second non-utf8 -> B

is it feasible (replaceAll?)?

St Nietzke
Rather than 'answer' your own question, you can modify the text of your original question. I'd delete this. There is still some confusion here: it sounded like you were wanting to ignore invalid byte sequences in the input -- bytes which do not encode anything at all in UTF-8. That's what my code does.If it parses correctly, then the character was correctly encode in UTF-8, and is representable in Unicode. So, you're not removing "non UTF-8" characters. You're simply wanting to remove certain characters. Clarify what you need.
Sean Owen