ansaurus

Question

Hard time with escape character

Answer 1

+5 A:

hello \u003c here < and > there

the \u003c in an ASCII file won't do it, you need to put the actual Unicode character in a Unicode encoded text file.

BioBuckyBall 2010-09-14 18:02:54

Answer 2

A:

Looks to me that the problem isn't with your escaping, but with the fact that you have unicode data you're trying to parse.

Have you tried using the two argument version of readFileToString, replacing your readFileToString(File) call with readFileToString(File, Encoding)?

Resources

FileUtils

zigdon 2010-09-14 18:03:09

Answer 3

+2 A:

When you compile your source file, the very first thing that happens--before any lexing or parsing--is that the Unicode escapes, \u003C and \u003E, get converted to the actual characters, < and >. So your code is really:

return data.replaceAll("(?i)[<|>|<|>]", "");

When you compile the code for the test against the string literal, the same thing happens; the test string that you wrote as:

"a < b > c\u003e\u003E\u003c\u003C"

...is really:

"a < b > c>><<"

But when you read the test string from a file, no such conversion occurs; you end up trying to match the six-character sequence \u003c with the single character, <. If you really want to match \u003C and \u003E, your code should look like this:

return data.replaceAll("(?i)(?:<|>|\\\\u003C|\\\\u003E)", "");

If you use one backslash, the Java compiler interprets it as a Unicode escape and converts it to < or >.
If you use two backslashes, the regex compiler interprets it as a Unicode escape and thinks you want to match a < or >.
If you use three backslashes, the Java compiler turns it into \< or \>, the regex compiler ignores the backslash, and it tries to match < or >.
So, to match a raw Unicode escape sequence, you have to use four backslashes to match the one backslash in the escape sequence.

Notice that I changed your brackets, too. [<|>] is a character class that matches <, | or >; what you want is an alternation.

Alan Moore 2010-09-14 18:54:20

Thanks for all: the explanation, catching my mistake about the brackets, and providing the fix I was looking for.

double07 2010-09-15 15:59:05

ansaurus

tags:

views:

answers:

Hard time with escape character

related questions