ansaurus

Question

Groovy Regex problem

Answer 1

A:

try

def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`

ULysses 2010-07-13 18:49:07

This will change the pattern so that it will NOT match `\u0000`, when that is clearly the intent.

polygenelubricants 2010-07-13 19:27:27

Answer 2

+2 A:

line 23:26: unexpected char: 0x0

This error message points to this part of the code:

def illegalChars = ~/[\u0000-...
12345678901234567890123

It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:

def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/

Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.

References

regular-expressions.info/Character Classes

On doubling the slash

Here's the relevant quote from java.util.regex.Pattern

Unicode escape sequences such as \u2014 in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

To illustrate, in Java:

System.out.println("\n".matches("\\u000A")); // prints "true"

However:

System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"

This is because \u000A, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:

System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"

This is not a legal Java source code.

polygenelubricants 2010-07-13 18:57:57

Thank you -- the code and the explanation explained a lot!And it does compile, so it seems like I'm in good shape.Thanks!

Rakesh Malik 2010-07-13 20:16:24

@Rakesh: If this answer solved your problem, please mark it as accepted by clicking on the check-mark icon.

Alan Moore 2010-07-13 21:13:10

Oops -- newbie mistake... done :)Should I move the 2nd part to a separate question?

Rakesh Malik 2010-07-13 21:18:45

@Rakesh: no need for a new question just yet, we can try to figure it out on this question for now. You can unaccept for now so people can see that you still have issues. If you can provide a complete snippet that reproduces the problem, that'd also be nice. @Alan: do you have insight on Rakesh's follow-up?

polygenelubricants 2010-07-13 21:21:20

I'll put up the entire snippet momentarily. I was doing a bit of cleanup so that it would be easier to read, as well as doing some experimenting to try to solve this issue, as opposed to just waiting for the community to solve it for me :)@polygenlubricants -- I'm wondering about the slashes now also, because I tried them in regexpal and it selected every non-alphabetical character -- except the one that I'm trying to strip.

Rakesh Malik 2010-07-13 21:41:11

Is there a way to put raw XML (well, in this case SGML, but it ought to work the same way) text up? I have an SGML snippet with illegal characters, but it doesn't look that way after publishing.

Rakesh Malik 2010-07-13 21:50:54

@Alan, @Rakesh: I've taken the initiative to splinter this issue to a different question: http://stackoverflow.com/questions/3241933/how-to-use-unicode-escapes-in-groovys-pattern-syntax

polygenelubricants 2010-07-13 22:07:11

@Rakesh: for now, try using `"pattern"` instead of `/pattern/`. So `"[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]"`. Tell me if that works.

polygenelubricants 2010-07-13 22:14:17

@polygenlubricants -- I'll follow it on the other thread, and check this one again, since the first question DID get answered here :)I'll also try the regular quotes, and see what happens.Thanks!

Rakesh Malik 2010-07-13 22:17:54

@Rakesh: Also try with `\0` instead of `\u0000`, no double slashing.

polygenelubricants 2010-07-13 22:22:47

Switching to a double-quoted string ended up doing the trick, so the script is working now.Thanks for all the help!

Rakesh Malik 2010-07-15 15:27:33

Answer 3

A:

OK here's my finding:

>>> print "XYZ".replaceAll(
       /[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
       "-"
    )

---

>>> print "X\0YZ".replaceAll(
       /[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
       "-"
    )

X-YZ

>>> print "X\0YZ".replaceAll(
       "[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
       "-"
    )

X-YZ

In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.

The \\uNNNN only works in "pattern", not in /pattern/.

I will edit my official answer based on comments to this "answer".

ansaurus

tags:

views:

answers:

Groovy Regex problem

References

On doubling the slash

Related questions

related questions