tags:

views:

103

answers:

3
A: 

try

def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`

ULysses
This will change the pattern so that it will NOT match `\u0000`, when that is clearly the intent.
polygenelubricants
+2  A: 
line 23:26: unexpected char: 0x0

This error message points to this part of the code:

def illegalChars = ~/[\u0000-...
12345678901234567890123

It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:

def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/

Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.

References


On doubling the slash

Here's the relevant quote from java.util.regex.Pattern

Unicode escape sequences such as \u2014 in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

To illustrate, in Java:

System.out.println("\n".matches("\\u000A")); // prints "true"

However:

System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"

This is because \u000A, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:

System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"

This is not a legal Java source code.

polygenelubricants
Thank you -- the code and the explanation explained a lot!And it does compile, so it seems like I'm in good shape.Thanks!
Rakesh Malik
@Rakesh: If this answer solved your problem, please mark it as accepted by clicking on the check-mark icon.
Alan Moore
Oops -- newbie mistake... done :)Should I move the 2nd part to a separate question?
Rakesh Malik
@Rakesh: no need for a new question just yet, we can try to figure it out on this question for now. You can unaccept for now so people can see that you still have issues. If you can provide a complete snippet that reproduces the problem, that'd also be nice. @Alan: do you have insight on Rakesh's follow-up?
polygenelubricants
I'll put up the entire snippet momentarily. I was doing a bit of cleanup so that it would be easier to read, as well as doing some experimenting to try to solve this issue, as opposed to just waiting for the community to solve it for me :)@polygenlubricants -- I'm wondering about the slashes now also, because I tried them in regexpal and it selected every non-alphabetical character -- except the one that I'm trying to strip.
Rakesh Malik
Is there a way to put raw XML (well, in this case SGML, but it ought to work the same way) text up? I have an SGML snippet with illegal characters, but it doesn't look that way after publishing.
Rakesh Malik
@Alan, @Rakesh: I've taken the initiative to splinter this issue to a different question: http://stackoverflow.com/questions/3241933/how-to-use-unicode-escapes-in-groovys-pattern-syntax
polygenelubricants
@Rakesh: for now, try using `"pattern"` instead of `/pattern/`. So `"[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]"`. Tell me if that works.
polygenelubricants
@polygenlubricants -- I'll follow it on the other thread, and check this one again, since the first question DID get answered here :)I'll also try the regular quotes, and see what happens.Thanks!
Rakesh Malik
@Rakesh: Also try with `\0` instead of `\u0000`, no double slashing.
polygenelubricants
Switching to a double-quoted string ended up doing the trick, so the script is working now.Thanks for all the help!
Rakesh Malik
A: 

OK here's my finding:

>>> print "XYZ".replaceAll(
       /[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
       "-"
    )

---

>>> print "X\0YZ".replaceAll(
       /[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
       "-"
    )

X-YZ

>>> print "X\0YZ".replaceAll(
       "[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
       "-"
    )

X-YZ

In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.

The \\uNNNN only works in "pattern", not in /pattern/.

I will edit my official answer based on comments to this "answer".

Related questions

polygenelubricants
Changing the type of string worked, so my script is now working as it should. Thanks!
Rakesh Malik