tags:

views:

173

answers:

5

I am writing Java code that has to distinguish regular expressions with more than one possible match from regular expressions that have only one possible match.

For example:

"abc." can have several matches ("abc1", abcf", ...), while "abcd" can only match "abcd".

Right now my best idea was to look for all unescaped regexp special characters.

I am convinced that there is a better way to do it in Java. Ideas?

(Late addition):

To make things clearer - there is NO specific input to test against. A good solution for this problem will have to test the regex itself.

In other words, I need a method who'se signature may look something like this:

boolean isSingleResult(String regex)

This method should return true if only for one possible String s1. The expression s1.matches(regex) will return true. (See examples above.)

A: 

If it can only have one possible match it isn't reeeeeally an expression, now, is it? I suspect your best option is to use a different tool altogether, because this does not at all sound like a job for regular expressions, but if you insist, well, no, I'd say your best option is to look for unescaped special characters.

David Hedlund
The problem is that even non-generic expressions are *really* expressions - otherwise my job would be as easy as try-catch-ing Pattern.compile()...
Amir Arad
using a try/catch-block for validating input where an exception is expected is also not a good design. but nevermind that, all i'm saying is that if you only want to deal with patterns entirely without special characters, then you are most likely trying to do something that is not best achieved with regular expressions after all. it is hard to say what your best solution would be, without knowing what it is you're trying to do in further detail, but something involving String.Contains, perhaps?
David Hedlund
I agree that I am in a somewhat sad situation. my task is very specific and well-defined, and it is to extend some existing filtering mechanism to use regex. The existing mechanism distinguishes between general(String.startsWith) and exact patterns by means of a prefix. my job is to use regex to extend searching capabilities and loose the all-annoying prefix.
Amir Arad
A: 

I see that the only way is to check if regexp matches multiple times for particular input.

package com;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class AAA {
    public static void main(String[] args) throws Exception {
        String input = "123 321 443 52134 432";
        Pattern pattern = Pattern.compile("\\d+");
        Matcher matcher = pattern.matcher(input);
        int i = 0;
        while (matcher.find()) {
            ++i;
        }
        System.out.printf("Matched %d times%n", i);
    }
}
denis.zhdanov
I think you didn't understood the OP. There is *no* input. There is only the regex.
BalusC
I didn't understand you right then.
denis.zhdanov
besides ignoring the question's problematic aspect (the need to match an infinite number of inputs), it is also not a real answer to the question...
Amir Arad
+1  A: 

This sounds dirty, but it might be worth having a look at the Pattern class in the Java source code.

Taking a quick peek, it seems like it 'normalize()'s the given regex (Line 1441), which could turn the expression into something a little more predictable. I think reflection can be used to tap into some private resources of the class (use caution!). It could be possible that while tokenizing the regex pattern, there are specific indications if it has reached some kind "multi-matching" element in the pattern.

Update

After having a closer look, there is some data within package scope that you can use to leverage the work of the Pattern tokenizer to walk through the nodes of the regex and check for multiple-character nodes.

After compiling the regular expression, iterate through the compiled "Node"s starting at Pattern.root. Starting at line 3034 of the class, there are the generalized types of nodes. For example class Pattern.All is multi-matching, while Pattern.SingleI or Pattern.SliceI are single-matching, and so on.

All these token classes appear to be in package scope, so it should be possible to do this without using reflection, but instead creating a java.util.regex.PatternHelper class to do the work.

Hope this helps.

BranTheMan
It's been a long time since I've done this sort of hacking so my info may be out of date, but I believe the default Java classloader stack restricts who can load java.* classes. I'm just saying it may also require a custom classloader. I still think it's probably the best solution short of creating your own regex parser.
PSpeed
thanks. boss decided we should simply mark the "non-regexp" string with a prefix, so it seems I won't dive into the Pattern class yet, but from all the answers I got your solution would probably be the way I'd go at it if I had to do this.
Amir Arad
A: 

The only regular expression that can ONLY match one input string is one that specifies the string exactly. So you need to match expressions with no wildcard characters or character groups AND that specify a start "^" and end "$" anchor.

  • "the quick" matches:

    • "the quick brownfox"
    • "the quick brown dog"
    • "catch the quick brown fox"
  • "^the quick brown fox$" matches ONLY:

    • "the quick brown fox"
Chris Nava
A: 

Hi, Now I understand what you mean. I live in Belgium...

So this is something what work on most expressions. I wrote this by myself. So maybe I forgot some rules.

public static final boolean isSingleResult(String regexp) {
    // Check the exceptions on the exceptions.
    String[] exconexc = "\\d \\D \\w \\W \\s \\S".split(" ");
    for (String s : exconexc) {
        int index = regexp.indexOf(s);
        if (index != -1) // Forbidden char found
        {
            return false;
        }
    }
    // Then remove all exceptions:
    String regex = regexp.replaceAll("\\\\.", "");
    // Now, all the strings how can mean more than one match
    String[] mtom = "+ . ? | * { [:alnum:] [:word:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]".split(" ");
    // iterate all mtom-Strings
    for (String s : mtom) {
        int index = regex.indexOf(s);
        if (index != -1) // Forbidden char found
        {
            return false;
        }
    }
    return true;
}

Martijn

Martijn Courteaux