views:

85

answers:

3

I need to strip out a few invalid characters from a string and wrote the following code part of a StringUtil library:

public static String removeBlockedCharacters(String data) {
    if (data==null) {
      return data;
    }
    return data.replaceAll("(?i)[<|>|\u003C|\u003E]", "");
}

I have a test file illegalCharacter.txt with one line in it:

hello \u003c here < and > there

I run the following unit test:

@Test
public void testBlockedCharactersRemoval() throws IOException{
    checkEquals(StringUtil.removeBlockedCharacters("a < b > c\u003e\u003E\u003c\u003C"), "a  b  c");
    log.info("Procesing from string directly: " + StringUtil.removeBlockedCharacters("hello \u003c here < and > there"));
    log.info("Procesing from file to string:  " + StringUtil.removeBlockedCharacters(FileUtils.readFileToString(new File("src/test/resources/illegalCharacters.txt"))));
}

I get:

INFO - 2010-09-14 13:37:36,111 - TestStringUtil.testBlockedCharactersRemoval(36) | Procesing from string directly: hello  here  and  there
INFO - 2010-09-14 13:37:36,126 - TestStringUtil.testBlockedCharactersRemoval(37) | Procesing from file to string:  hello \u003c here  and  there

I am VERY confused: as you can see, the code properly strips out the '<', '>', and '\u003c' if I pass a string containing these values but it fails to strip out '\u003c' if I read from a file containing the same string.

My questions, so that I stop loosing hair over it, are:

  1. Why do I get this behavior?
  2. How can I change my code to properly strip \u003c in all occasions?

Thanks

+5  A: 

hello \u003c here < and > there

the \u003c in an ASCII file won't do it, you need to put the actual Unicode character in a Unicode encoded text file.

BioBuckyBall
A: 

Looks to me that the problem isn't with your escaping, but with the fact that you have unicode data you're trying to parse.

Have you tried using the two argument version of readFileToString, replacing your readFileToString(File) call with readFileToString(File, Encoding)?

Resources

zigdon
+2  A: 

When you compile your source file, the very first thing that happens--before any lexing or parsing--is that the Unicode escapes, \u003C and \u003E, get converted to the actual characters, < and >. So your code is really:

return data.replaceAll("(?i)[<|>|<|>]", "");

When you compile the code for the test against the string literal, the same thing happens; the test string that you wrote as:

"a < b > c\u003e\u003E\u003c\u003C"

...is really:

"a < b > c>><<"

But when you read the test string from a file, no such conversion occurs; you end up trying to match the six-character sequence \u003c with the single character, <. If you really want to match \u003C and \u003E, your code should look like this:

return data.replaceAll("(?i)(?:<|>|\\\\u003C|\\\\u003E)", "");
  • If you use one backslash, the Java compiler interprets it as a Unicode escape and converts it to < or >.

  • If you use two backslashes, the regex compiler interprets it as a Unicode escape and thinks you want to match a < or >.

  • If you use three backslashes, the Java compiler turns it into \< or \>, the regex compiler ignores the backslash, and it tries to match < or >.

  • So, to match a raw Unicode escape sequence, you have to use four backslashes to match the one backslash in the escape sequence.

Notice that I changed your brackets, too. [<|>] is a character class that matches <, | or >; what you want is an alternation.

Alan Moore
Thanks for all: the explanation, catching my mistake about the brackets, and providing the fix I was looking for.
double07