views:

265

answers:

5
+1  Q: 

Regex gives error

Continuing with the post at http://stackoverflow.com/questions/705672/regular-expression-to-allow-a-set-of-characters-and-disallow-others/705990#705990

Does anybody know why the below would occur?

I get the below error when I create a regular expression as:

[^@*–’”“\r\nœçsÇSgGšcrŠRNEŽDTCnežuUIti—¿„”]+

and enter any of these restricted characters in the input field

java.lang.ArrayIndexOutOfBoundsException
    at org.apache.regexp.RECompiler$RERange.delete(RECompiler.java:1326)
    at org.apache.regexp.RECompiler$RERange.remove(RECompiler.java:1417)
    at org.apache.regexp.RECompiler$RERange.include(RECompiler.java:1459)
    at org.apache.regexp.RECompiler$RERange.include(RECompiler.java:1470)
    at org.apache.regexp.RECompiler.characterClass(RECompiler.java:699)
    at org.apache.regexp.RECompiler.terminal(RECompiler.java:863)
    at org.apache.regexp.RECompiler.closure(RECompiler.java:942)
    at org.apache.regexp.RECompiler.branch(RECompiler.java:1151)
    at org.apache.regexp.RECompiler.expr(RECompiler.java:1203)
    at org.apache.regexp.RECompiler.compile(RECompiler.java:1281)
    at org.apache.regexp.RE.(RE.java:495)
    at org.apache.regexp.RE.(RE.java:480)

but this expression works perfectly fine

[^@*–’”“\r\nœçsÇSgGšcrŠRN]+

Also,

[^@*–’”“\r\nœçsÇSgGšcrŠR„”]+

works but

[^@*–’”“\r\nœçsÇSgGšcrŠRNE]+

does not work and gives the above error.

Is there a limit to the number of characters that can be disallowed like the way above?

Regards, Udit Sud

A: 

Looks like some error in apache regexp parser. Can you use a standart one (java.util.regex)?

Vanger
Only author of question can mark this post as "right asnwer"? Too bad he didn't do this.. :(
Vanger
A: 

I'm not a big Regex man myself, but here are 3 regex testing sites that might help:

http://www.txt2re.com/index.php3
http://gskinner.com/RegExr/
http://regex.larsolavtorvik.com/

WebDevHobo
+3  A: 

The dash (minus sign) has special meaning in character classes. It defines ranges of consecutive characters, like "a-z".

There may exist a consecutive range for "*–’", but I guess this is not your intention. You probably wanted the literal dash, and I suspect the exception you are seeing has something to do with this.

Instead of this:

[^@*–’”“\r\nœçsÇSgGšcrŠRNEŽDTCnežuUIti—¿„”]+
----^ (this is the error)

Try:

[^@*’”“\r\nœçsÇSgGšcrŠRNEŽDTCnežuUIti—¿„”–]+
-----------------------------------------^ (this okay)

or

[-^@*’”“\r\nœçsÇSgGšcrŠRNEŽDTCnežuUIti—¿„”]+
-^ (this okay as well)

or

[^@*\–’”“\r\nœçsÇSgGšcrŠRNEŽDTCnežuUIti—¿„”]+
----^^ (this okay as well)
Tomalak
+1. Now I do wonder, if it doesn't give any error with the .Net engine, does it behave as expected?!
Lieven
I guess that "*–’" in fact *is* a valid range, but something about it makes the RECompiler barf. Maybe I'm wrong, but the dash is quite suspicious.
Tomalak
My guess is that it trips up on including more of the same character in one character class. Could /[aa]/ crash it?
strager
Unless someone actually looks at RECompiler.java, we can only guess. :-) In theory, [aa] is fine and should not cause an error in any implementation.
Tomalak
That's not a hyphen you're pointing at, it's an en-dash (U+2013), and the other one is an em-dash (U+2014). Neither has any special meaning. The error is due to the bug Eddie pointed out, and not anything the OP is doing.
Alan Moore
@Alan: It looked like a hyphen to me, though I admit I didn't cross-check. Thanks for the hint.
Tomalak
A: 

Thanks all, using the standard parser works fine. Thanks for all your help. -Udit.

You might want to send this kind of messages via comments to your question rather than answers.
Manrico Corazzi
+2  A: 

See http://webui.sourcelabs.com/jakarta-regexp/issues/22804 -- an old version of this package did limit the maximum number of allowed ranges, to 16. This was fixed in 2003, but it's conceivable that they just raised the limit a little bit. (?)

Anyway, I always use the Java built-in Regex parser for any new work, as others have suggested.

Eddie
+1 for pointing out the appropriate bug.
Tomalak