Is there a way to find out if two arbitrary regular expressions are equivalent? Looks like complex problem to me, but there might be some DFA simplification mechanism or something?
These two Perlmonks threads discuss this question (specifically, read blokhead's responses):
To test equivalence you can compute the minimal DFAs for the expressions and compare them.
Due to the complexity and variation its going to be difficult, if not near on impossible.
The only thing I can think of would be to take a look at aRegexp Parser (Link Below), which will ease the pain of picking a pattern apart, but to put them back together again with an eye to checking for similarity will be difficult.
http://search.cpan.org/search?mode=module&query=Regexp%3A%3AParser
Testability of equality is one of the classical properties of regular expressions. (N.B. This doesn't hold if you're really talking about Perl regular expressions or some other technically nonregular superlanguage.)
Turn your REs to generalised finite automata A and B, then construct a new automaton A-B such that the accepting states of A have null transitions to the start states of B, and that the accepting states of B are inverted. This gives you an automaton that accepts all those strings accepted by A, except for all those accepted by B.
Do the same for B-A, and reduce both to pure FAs. If an FA has no accepting states accessible from a start state then it accepts the empty language. If you can show that both A-B and B-A are empty, you've shown that A = B.
Edit
Heh, I can't believe no one noticed the gigantic error there -- an intentional one, of course :-p
The automata A-B as described will accept those strings whose first half is accepted by A and whose second half is not accepted by B. Building the desired A-B is a slightly trickier process. I can't think of it off the top of my head, but I do know it's well-defined (and likely involves creating states to the represent the products of accepting states in A and non-accepting states in B).
This really depends on what you mean by regular expressions. As the other posters pointed out, reducing both expressions to their minimal DFA should work, but it only works for the pure regular expressions.
Some of the constructs used in the real world regex libs (backreferences in particular) give them power to express languages that aren't regular, so the DFA algorithm won't work for them. For example the regex : ([a-z]*) \1
matches a double occurence of the same word separated by a space (a a
and b b
but not b a
nor a b
). This cannot be recognized by a finite automaton at all.