tags:

views:

297

answers:

6

Hi Guys,

I have been using the java.util.regex.* classes for Regular Expression in Java and all good so far. But today I have a different requirement. For example consider the pattern to be "aabb". Now if the input String is aa it will definitely not match, however there is still possibility that if I append bb it becomes aabb and it matches. However if I would have started with cc, no matter what I append it will never match.

I have explored the Pattern and Matcher class but didn't find any way of achieving this.

The input will come from user and system have to wait till pattern matches or it will never match irrespective of any input further.

Any clue?

Thanks.

+1  A: 

So you want to know not whether a String s matches the regex, but whether there might be a longer String starting with s that would match? Sorry, Regexes can't help you there because you get no access to the internal state of the matcher; you only get the boolean result and any groups you have defined, so you never know why a match failed.

If you're willing to hack the JDK libraries, you can extend (or probably fork) java.util.regex and give out more information about the matching process. If the match failed because the input was 'used up' the answer would be true; if it failed because of character discrimination or other checks it would be false. That seems like a lot of work though, because your problem is completely the opposite of what regexes are supposed to do.

Another option: maybe you can simply redefine the task so that you can treat the input as the regexp and match aabb against aa.*? You have to be careful about regex metacharacters, though.

Kilian Foth
Re your second paragraph: I'd say "If the match failed because the input was 'used up' *at any time during the match attempt* the answer would be true". After all, the regex engine might have matched until the end once, then backtracked and failed without ever getting back to the end of the string. Like when applying `^A.*BC$` to `ABCD`.
Tim Pietzcker
So it seems that the `hitEnd()` method Alan Moore wrote about does exactly that. Great.
Tim Pietzcker
A: 

For the example you give you could try to use an anti-pattern to disqualify invalid results. For example "^[^a]" would tell you you're input "c..." can't match your example pattern of "aabb".

Depending on your pattern you may be able to break it up into smaller patterns to check and use multiple matchers and then set their bounds as one match occurs and you move to the next. This approach may work but if you're pattern is complex and can have variable length sub-parts you may end up reimplementing part of the matcher in your own code to adjust the possible bounds of the match to make it more or less greedy. A pseudo-code general idea of this would be:

boolean match(String input, Matcher[] subpatterns, int matchStart, int matchEnd){
  matcher = next matcher in list;
  int stop = matchend;
  while(true){
    if matcher.matches input from matchstart -> matchend{
      if match(input, subpatterns, end of current match, end of string){
        return true;
      }else{
        //make this match less greedy
        stop--;
      }
    }else{
      //no match
      return false;
    }
  }
}

You could then merge this idea with the anti-patterns, and have anti-subpatterns and after each subpattern match you check the next anti-pattern, if it matches you know you have failed, otherwise continue the matching pattern. You would likely want to return something like an enum instead of a boolean (i.e. ALL_MATCHED, PARTIAL_MATCH, ANTI_PATTERN_MATCH, ...)

Again depending on the complexity of your actual pattern that you are trying to match writing the appropriate sub patterns / anti-pattern may be difficult if not impossible.

M. Jessup
A: 

You might be able to accomplish this with a state machine (http://en.wikipedia.org/wiki/State_machine). Have your states/transitions represent valid input and one error state. You can then feed the state machine one character (possibly substring depending on your data) at a time. At any point you can check if your state machine is in the error state. If it is not in the error state then you know that future input may still match. If it is in the error state then you know something previously failed and any future input will not make the string valid.

brainimus
A: 

One way to do this is to parse your regex into a sequence of sub-regexes, and then reassemble them in a way that allows you to do partial matches; e.g. "ab*c" has 3 sub-regexes "a", "b*" and "c" which you can then reassemble as "a(b*(c)?)?".

Things get more complicated when the input regex contains alternation and groups, but the same general approach should work.

The problem with this approach is that the resulting regex is more complicated, and could potentially lead to excessive backtracking for complex input regexes.

Stephen C
A: 

If you make each character of the regex optional and relax the multiplicity constraints, you kinda get what you want. Example if you have a matching pattern "aa(abc)+bbbb", you can have a 'possible match' pattern 'a?a?(a?b?c?)*b?b?b?b?'.

This mechanical way of producing possible-match pattern does not cover advanced constructs like forward and backward refs though.

ddimitrov
+5  A: 

You should have looked more closely at the Matcher API; the hitEnd() method works exactly as you described:

import java.util.regex.*;

public class Test
{
  public static void main(String[] args) throws Exception
  {
    String[] ss = { "aabb", "aa", "cc", "aac" };
    Pattern p = Pattern.compile("aabb");
    Matcher m = p.matcher("");

    for (String s : ss) {
      m.reset(s);
      if (m.matches()) {
        System.out.printf("%-4s : match%n", s);
      }
      else if (m.hitEnd()) {
        System.out.printf("%-4s : partial match%n", s);
      }
      else {
        System.out.printf("%-4s : no match%n", s);
      }
    }
  }
}

output:

aabb : match
aa   : partial match
cc   : no match
aac  : no match

As far as I know, Java is the only language that exposes this functionality. There's also the requireEnd() method, which tells you if more input could turn a match into a non-match, but I don't think it's relevant in your case.

Both methods were added to support the Scanner class, so it can apply regexes to a stream without requiring the whole stream to be read into memory.

Alan Moore
"As far as I know, Java is the only language that exposes this functionality." -- is this not equivalent to Boost's partial matching? (http://www.boost.org/doc/libs/1_34_1/libs/regex/doc/partial_matches.html)
polygenelubricants
That is cool. Can you try what `hitEnd()` returns when you match `ABCD` against `A.*BC$`?
Tim Pietzcker
@Tim, I get "partial match", which makes sense, since you can add "BC" to the end and get a match (which I did, and it did).
Alan Moore
@poly: Yes, that does seem to be the same as Java's `hitEnd()`; thanks for the pointer. I don't see any equivalent for `requireEnd()` though.
Alan Moore