ansaurus

Question

Can I Determine the Set of First Chars Matched by Regex Pattern?

Answer 1

+3 A:

I think you could parse the regular expression and define some recursive function which operates on the parsed regular expression in a left-to-right-manner, building up such a set of firsts.

Some things are simple:

Sequence: first(r1r2) = first(r1) + ( if '' in first(r1) first(r2) else empty set )
Alternation: first(r1|r2) = first(r1) + first(r2)
Iteration: first(r*) = first(r) + ''
Characters: first(c) = c
Characterclasses: first([c1-cn]) = set(c1, c2, ..., cn) ...

Extend this to all primitives and special flags your regular expression dialect knows and you are good to go.

Tetha 2009-04-24 19:09:02

Yeah, I thought about that. This would effectively be the same as building the front end of the DFA myself. Maybe I'll do this if it comes down to it, but I would rather find a simpler solution.

Daniel Spiewak 2009-04-24 19:11:00

I am not sure how much simpler it becomes than parsing it (according to a fixed grammar from the language standard) and some pretty obvious recursions, but maybe that is just my compiler-construction infused brain

Tetha 2009-04-24 19:16:13

Well, parsing and then recursive traversal aren't too bad, I'm just not happy about having to replicate Java's regular expression semantics just to get a FIRST set.

Daniel Spiewak 2009-04-24 19:24:56

Answer 2

+1 A:

You could solve it recursivly ...

Strip of enclosing parenthesis and call recursivly.
Split at toplevel alternatives and call recursivly for each part.
If there are no alternatives,
- output all symbols starting from the left up to the first none optional symbol.
- If there are charachter groups, output all symbols.

There are probably a lot of errors in this idea, but this is what I would try. You have to strip out assertion, group names and thousand other things. And if you find an inverted character class like [^0-9] you have to output a lot of characters.

So I assume it is really a complex problem.

Daniel Brückner 2009-04-24 19:14:22

I hadn't thought about negation classes. Considering that Java uses UTF-8, the FIRST set could have a size in the hundreds of thousands. Ouch! Maybe I'll just assume that a regular expression can match anything and leave it unoptimized...

Daniel Spiewak 2009-04-24 19:26:07

I just thought about rewriting the expression - if you could build an expression that matches only the first symbol that the input expression, you could loop over some single symbol string and test if the derived expression matches. But to get all characters you would have to loop over the whole character set - a lot of work for Unicode. And i cannot think of a method to get such an derived expression. Maybe hacking into the regex implementation and examine the internal state via reflection.

Daniel Brückner 2009-04-24 19:42:30

But all this would probably come clos to writing a regex implementation yourself. So I am out of (good) ideas.

Daniel Brückner 2009-04-24 19:43:47

ansaurus

tags:

views:

answers:

Can I Determine the Set of First Chars Matched by Regex Pattern?

related questions