views:

153

answers:

3

Let's say we have regular expressions:

  • Hello W.*rld
  • Hello World
  • .* World
  • .* W.*

I would like to minimize the number of regexes required to match arbitrary input.

To do that, I need to find if one regular expression matches any input matched by another expression. Is that possible?

Billy3

+1  A: 

Yes.

The problem of equivalence of two regular languages is decidable.

Sketch of an algorithm:

  • minimize both DFAs
  • check if they are isomorph
maxschlepzig
@maxschlepzig: Graph isomorphism is not solvable in polynomial time, so I don't see how that helps.
Billy ONeal
@Billy: I guess your answer is that this is a theoretically solvable problem that isn't practical to solve.
szbalint
@szbalint: Well "theoretically" I could pose every possible input string to the languages and see if they match the same thing. If it's not solvable on reasonable consumer hardware, then there's little point.
Billy ONeal
@Billy ONeal: What have graph isomorphism to do with it? And don't bash theoretical cs. The decidable results are of practical interest. E.g. the problem of equivalence of two context free languages is not decidable.
maxschlepzig
@maxschlepzig: I'm not "bashing theoretical CS" -- but I think it's obvious enough from my question that proving that the problem is merely decidable would not be all that useful of an answer.
Billy ONeal
+2  A: 

Sure!. A regular expression can be represented as an FSM (Finite State Machine) and there are technically infinite number of FSM that can recognize the same string.

Isomorphism is the name that describes if two FSM are equivalent. There are a couple of algorigthm to minimize an FSM. For example the Hopcroft minimization algorithm can minimize two FSM in O(n log n), on an n state automaton.

Dani Cricco
@Dani: Same problem with maxschlepzig's answer. Isomorphism is in class NP.
Billy ONeal
@Billy ONeal: First, (graph) isomorphism is in NP (that's true), but is believed not to be NP-complete, although not in P. However, we are talking about DFA isomorphism, which is a _completely_ different thing.
jpalecek
@jpalecek: How is DFA isomorphism any different? Is a DFA nothing more than a digraph?
Billy ONeal
@Billy: DFA isomorphism is in class P
Dani Cricco
@Billy DFA isomorphism is simpler. You only have one starting point
Dani Cricco
@Billy ONeal: Basically, the biggest difference between a DFA and a graph is that a DFA is labeled, so you know what to match in the first place (initial/terminal states, edges with the same letter). Also, there are DFAs that are isomorphic (we are talking about isomorphism defined by equivalence of suffixes starting from a state, there are other possible definitions of isomorphism) but their graph representation is not.
jpalecek
Just to be fair: To do what you want (which is equivalent to testing regexp equivalence), you need to do a possibly exponential NFA-to-DFA conversion. This problem is maybe not even in NP. However, this doesn't make it impractical; exponential algorithms are used all around. You just have to assess the trade-offs and if this is not what you need, you have to look for other ways to solve your primary problem (which we know nothing about).
jpalecek
+4  A: 

Any regular expression can be linked to a DFA - you can minimize the DFA and since the minimal form is unique, you can decide whether two expressions are equivalent. Dani Cricco pointed out the Hopcroft O(n log n) algorithm. There is another improved algorithm by Hopcroft and Craft which tests the equivalence of two DFAs in O(n).

For a good survey on the matter and an interesting approach to this, I reccomend the paper Testing the Equivalence of Regular Languages, from arXiv.

Later edit: if you are interested in inclusion rather than equivalence for regular expressions, I have come across a paper that might be of interest: Inclusion Problem for Regular Expressions - I have only skimmed through it but it seems to contain a polynomial time algorithm to the problem.

Lawrence
@Lawrence: Hmmm.. interesting. One issue though is that `.*` and `Hello World` are decidedly not equivalent, though `.*` can match anything `Hello World` can match.
Billy ONeal
I'm not sure of the meaning of "match" for you - it seems as though you don't want to test equivalence but rather inclusion. Can you be more exact with your question?
Lawrence
@Lawrence: My difficulty is that I don't know exactly how to describe what I'm looking for -- I'm sorry for the runaround here. I have modified the question slightly -- from Wikipedia's description on set theory inclusion does seem to what I need.
Billy ONeal
@Billy: I think the linked paper I just added is what you're looking for.
Lawrence
@Billy ONeal: If you're OK with an approximation (which you probably are), the last paper in this answer is probably good for you.
jpalecek