views:

277

answers:

4

Hi,

I'd like to port a generic text processing tool, Texy!, from PHP to Java.

This tool does ungreedy matching everywhere, using preg_match_all("/.../U"). So I am looking for a library, which has some UNGREEDY flag.

I know I could use the .*? syntax, but there are really many regular expressions I would have to overwrite, and check them with every updated version.

I've checked

  • ORO - seems to be abandoned
  • Jakarta Regexp - no support
  • java.util.regex - no support

Is there any such library?

Thanks, Ondra

+3  A: 

Update: After checking the docs I found the LAZY flag, which is another term for non-greedy. However it only appears to be available in OpenJDK

p = Pattern.compile("your regex here", LAZY);
p.matcher("string to match")

Original deprecated response I honestly don't think there's one.

The whole point of the +? and *? is so that you can choose which sections to do greedily and which ones to do lazily.

Greedy is the default behaviour because that's the most commonly use of + and * in regular expressions. In fact I can't think of a single regex parser that does it the other way around. As in where a modifier is used to make something greedy, and the default is lazy matching.

I know this isn't the answer you're looking for, but, the only way I think you'll be able to make it work is to add the ? to your *'s and +'s. On the upside you can use regular expressions to help determine which ones need to be changed. Or even make the changes for you if all of them need to be changed. Or if you can can describe a pattern that identifies which need to be changed.

EmFi
so are you saying there is no way to change the default behavior? Having a default behavior that can not be changed just because it's "the most common[...]" doesn't mean that having the switch is a bad idea
hhafez
I wasn't saying it's impossible, or even unnecessary. I was just stating that based on my experience in a number of languages. I've never even seen a laziness switch for a regular expression until the Asker mentioned preg_match_all("/.../U").
EmFi
Wow, when it's in OpenJDK, then there's a good chance of this making it into Sun JDK! And, hopefully, I can take OpenJDK's implementation and use it in Sun JDK.But, where did you find it? It's not in the doc:http://www.jdocs.com/javase/7.b12/java/util/regex/Pattern.html(which should be OpenJDK's doc).
Ondra Žižka
Here's where I found it. Take note it's listed as a final int which usually means flag. But it's there isn't exactly a description. So it may be unimplemented.http://www.docjar.com/docs/api/java/util/regex/Pattern.html
EmFi
Unfortunately, this is just a constant used for parsing and processing closures. `docjar.com` creates docs from the source, and it shows private scope.
Ondra Žižka
+1  A: 

About the idea of checking and rechecking all regular expressions, are you sure that the php and java libraries agree enough on syntax that you wouldn't have to do this anyway? What I'd do up front is go through them all and write some tests (input and output) and make sure that they work the same in both implementations. Then devise a way to run them automatically and you will be covered for future upgrades and incompatibilities. You'll still need to tweak stuff, but at least you'll know where.

Jeremy Huiskamp
Well, java.util.regex should be Perl5 compatible, not counting few features, which are not used in the tool - besides this one.And sure, I've asked the author of the PHP original to create some tests which would kind of certify other implementations.
Ondra Žižka
+1  A: 

I suggest you create your own modified Java library. Simply copy the java.util.regex source into your own package.

The Sun JDK 1.6 Pattern.java class offers these default flags:

static final int GREEDY     = 0;

static final int LAZY       = 1;

static final int POSSESSIVE = 2;

You'll notice that these flags are only used a couple of times, and it would be trivial to modify. Take the following example:

    case '*':
        ch = next();
        if (ch == '?') {
            next();
            return new Curly(prev, 0, MAX_REPS, LAZY);
        } else if (ch == '+') {
            next();
            return new Curly(prev, 0, MAX_REPS, POSSESSIVE);
        }
        return new Curly(prev, 0, MAX_REPS, GREEDY);

Simply change the last line to use the 'LAZY' flag instead of the GREEDY flag. Since your wanting a regex library to behave like the PHP one, this might be the best way to go.

brianegge
Already looking at it just now :)
Ondra Žižka
Actually, the patch for this RFE would be as simple as replacing the GREEDY in the default return path with a variable created from the flags. Great, I'm gonna submit a patch to JDK :)
Ondra Žižka
+1  A: 

You may be able to use 'com.caucho.quercus.lib.regexp.JavaRegexpModule'. Quercus is a Java implementation of PHP, and the regex library implements the PHP regex syntax and method names.

brianegge