views:

594

answers:

3

I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?

+16  A: 

There's no easy way.

As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.

Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).

The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).

Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:

Some classes of regular languages can only be described by deterministic finite automata whose size grows exponentially in the size of the shortest equivalent regular expressions. The standard example are here the languages L_k consisting of all strings over the alphabet {a,b} whose kth-last letter equals a.

By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.

Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:

if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')

and X ->a Y (in the other NFA)

then (A,X) ->a (B,Y) in the intersection

(C,Z) is final iff C is final in the one NFA and Z is final in the other.

To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.

If you allow epsilon transitions (that don't output a letter), that's fine:

if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.

Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.

Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.

This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).

wrang-wrang
Excellent answer, though this is pretty much exactly what I was hoping to avoid needing to code myself ;) OpenFST looks pretty neat though.
Joseph Garvin
+1  A: 

In theory, the problem you describe is impossible.

In practice, if you have a manageable number of regular expressions that use a limited subset or of regexp syntax, and/or a limited selection of strings that can be used to match against the container of regular expressions, you might be able to solve it.

Assuming you're not trying to solve the abstract general case, there might be something you can do to solve a practical application. Perhaps if you provided a representative sample of the regexps, and described the strings you'd be matching with, a heuristic could be created to solve the problem.

ironchefpython
"regexps" that recognize context-free or worse languages do make it impossible to prove that there's no string in the intersection. Obviously these are not really "regular expressions" but something inspired by them. But the full regular expression language can be captured w/ NFAs, and most of what people actually use from "regexp".
wrang-wrang
I agree that regexps can be decomposed into NFAs, I'm just thinking it's possible that the original poster has a more specific use case, such as regexps like "c[aeiou]t" and "d[aeiou]g", and the allowable strings are an English dictionary.
ironchefpython
+2  A: 

This regex inverter (written using pyparsing) works with a limited subset of re syntax (no * or + allowed, for instance) - you could invert two re's into two sets, and then look for a set intersection.

Paul McGuire