tags:

views:

563

answers:

5

I am writing a tool to help students learn regular expressions. I will probably be writing it in Java.

The idea is this: the student types in a regular expression and the tool shows which parts of a text will get matched by the regex. Simple enough.

But I want to support several different regex "flavors" such as:

  • Basic regular expressions (think: grep)
  • Extended regular expressions (think: egrep)
  • A subset of Perl regular expressions, including the character classes \w, \s, etc.
  • Sed-style regular expressions

Java has the java.util.Regex class, but it supports only Perl-style regular expressions, which is a superset of the basic and extended REs. What I think I need is a way to take any given regular expression and escape the meta-characters that aren't part of a given flavor. Then I could give it to the Regex object and it would behave as if it was written for the selected RE interpreter.

For example, given the following regex:

^\w+[0-9]{5}-(\d{4})?$

As a basic regular expression, it would be interpreted as:

^\\w\+[0-9]\{5\}-\(\\d\{4\}\)\?$

As an extended regular expression, it would be:

^\\w+[0-9]{5}-(\\d{4})?$

And as a Perl-style regex, it would be the same as the original expression.

Is there a "regular expression for regular expressions" than I could run through a regex search-and-replace to quote the non-meta characters? What else could I do? Are there alternative Java classes I could use?

+1  A: 

Alternatively, you could use Jakarta ORO?

This supports the following regex 'flavors':

  • Perl5 compatible regular expressions
  • AWK-like regular expressions
  • glob expressions
toolkit
+1  A: 

check out this post for a 'regular expression for regular expressions': http://stackoverflow.com/questions/172303/is-there-a-regular-expression-to-detect-a-valid-regular-expression

You can use this as a basis for your module.

Manu
A: 

I have written something similar: http://stackoverflow.com/questions/172303/is-there-a-regular-expression-to-detect-a-valid-regular-expression#172316

You could take part of that expression, and match each token separatly:

[^?+*{}()[\]\\]                # literal characters
\\[A-Za-z]                     # Character classes
\\\d+                          # Back references
\\\W                           # Escaped characters
\[\^?(?:\\.|[^\\])+?\]         # Character classs
\((?:\?[:=!>]|\?<[=!])?        # Beginning of a group
\)                             # End of a group
(?:[?+*]|\{\d+(?:,\d*)?\})\??  # Repetition
\|                             # Alternation

For each match, you could have some dictionary of appropriate replacements in the target flavor.

MizardX
A: 
anjanb
A: 

If your target is a Unix / Linux system, why just shell out to the definitive host of each regex? ie, use grep for BRE, egrep for ERE, perl for PCRE, etc? The only thing your module would need to do is the UI. Most of the regex testers that I have seen (that are decent) use a variant of this approach.

If you want yet another library suggestion, look at TRE for the BRE / ERE / POSIX / AWK part. It does not support back references, so PCRE / Python / Ruby / JS / Java is out...

drewk