views:

224

answers:

2

I would like to be able to compute the set of all characters which may be matched as the first character in a string by a given instance of java.util.regex.Pattern. More formally, given the DFA equivalent to a certain regular expression, I want the set of all outgoing transitions from the start state.

An example:

Pattern p = Pattern.compile("[abc]def|daniel|chris|\\s+");
Set<Character> first = getFirstSet(p);

The set first should contain the following elements:

{ 'a', 'b', 'c', 'd', ' ', '\n', '\r', '\t' }

Any ideas? I'm well aware that I could construct the DFA myself and determine the relevant states that way, but I'd like to avoid that kind of hassle (read: it's not worth that much to me). Note that my host language is actually Scala, so I have access to all of the core Scala libs (for what it's worth).

+3  A: 

I think you could parse the regular expression and define some recursive function which operates on the parsed regular expression in a left-to-right-manner, building up such a set of firsts.

Some things are simple:

  • Sequence: first(r1r2) = first(r1) + ( if '' in first(r1) first(r2) else empty set )
  • Alternation: first(r1|r2) = first(r1) + first(r2)
  • Iteration: first(r*) = first(r) + ''
  • Characters: first(c) = c
  • Characterclasses: first([c1-cn]) = set(c1, c2, ..., cn) ...

Extend this to all primitives and special flags your regular expression dialect knows and you are good to go.

Tetha
Yeah, I thought about that. This would effectively be the same as building the front end of the DFA myself. Maybe I'll do this if it comes down to it, but I would rather find a simpler solution.
Daniel Spiewak
I am not sure how much simpler it becomes than parsing it (according to a fixed grammar from the language standard) and some pretty obvious recursions, but maybe that is just my compiler-construction infused brain
Tetha
Well, parsing and then recursive traversal aren't too bad, I'm just not happy about having to replicate Java's regular expression semantics just to get a FIRST set.
Daniel Spiewak
+1  A: 

You could solve it recursivly ...

  • Strip of enclosing parenthesis and call recursivly.
  • Split at toplevel alternatives and call recursivly for each part.
  • If there are no alternatives,
    • output all symbols starting from the left up to the first none optional symbol.
    • If there are charachter groups, output all symbols.

There are probably a lot of errors in this idea, but this is what I would try. You have to strip out assertion, group names and thousand other things. And if you find an inverted character class like [^0-9] you have to output a lot of characters.

So I assume it is really a complex problem.

Daniel Brückner
I hadn't thought about negation classes. Considering that Java uses UTF-8, the FIRST set could have a size in the hundreds of thousands. Ouch! Maybe I'll just assume that a regular expression can match anything and leave it unoptimized...
Daniel Spiewak
I just thought about rewriting the expression - if you could build an expression that matches only the first symbol that the input expression, you could loop over some single symbol string and test if the derived expression matches. But to get all characters you would have to loop over the whole character set - a lot of work for Unicode. And i cannot think of a method to get such an derived expression. Maybe hacking into the regex implementation and examine the internal state via reflection.
Daniel Brückner
But all this would probably come clos to writing a regex implementation yourself. So I am out of (good) ideas.
Daniel Brückner