tags:

views:

229

answers:

5

Hi,

Im looking for function (PHP will be the best), which returns true whether exists string matches both regexpA and regexpB.

Example 1:

$regexpA = '[0-9]+';
$regexpB = '[0-9]{2,3}';

hasRegularsIntersection($regexpA,$regexpB) returns TRUE because '12' matches both regexps

Example 2:

$regexpA = '[0-9]+';
$regexpB = '[a-z]+';

hasRegularsIntersection($regexpA,$regexpB) returns FALSE because numbers never matches literals.

Thanks for any suggestions how to solve this.

Henry

A: 

Duplicate: http://stackoverflow.com/questions/489095/how-to-determine-if-a-regex-is-orthogonal-to-another-regex

mcandre
His question is not wether a specific string matches, but wether *any* string could *ever* match both. Are you proposing testing every possible string? Because there are some rather obvious practical issues...
Frank Farmer
Not what the OP asked for (although this was my reaction on the first read-through).
Chris
This does look like a duplicate to me. +1
erisco
Duplicate candidates are best noted in comments as they are not really answers
kemp
A: 

Your question is framed as a string-matching problem, but actually refers to the very interesting problem of whether or not it can be proved that two distinct state machines could reach an 'accept' state given some input.

This seems like a fairly theoretical computability problem to my (admittedly untrained) eye. Anyone else have any thoughts? Smells NP-complete...

Chris
+1  A: 

It is possible. I encountered it once with Pellet OWL reasoner while learning semantic web technologies.

Here is an example that shows how regular expressions can be parsed into a tree structure. You could then (in theory) parse your two regular expressions to trees and see if one tree is a subset of the other tree, ie. if one tree can be found in within other tree's nodes.

If it is found, then the other regular expression will match (not only, but also) a subset of what the first regular expression will match.

It is not much of a solution, but maybe it'll help you.

mqchen
This is fun, but ... once you have this tree, what do you do next?
Hamish Grubijan
Looking at the syntax tree of two regular expressions will not tell you whether they're equivalent. Two regexen can have completely different syntax trees and still be equivalent (take `/(1+|0+)+/` and `/0(0|1)*|1(1|0)*/` for example).
sepp2k
Hmm, I'm a little confused. How can both `/(1+|0+)+/` and `/0(0|1)*|1(1|0)*/` match the same string? The first regex requires the string to begin with `1` while the other requires it begin with `0`. (I'm a bit tired right now, so I might be completely off...)
mqchen
@mq.chen: Both regexen are equivalent to `/(1|0)+/` (or `/(0|1)+/`, which is the same thing). Notice the `|` - both regexen can start with either 1 or 0.
sepp2k
+1  A: 

For regular expressions that are actually regular (i.e. don't use irregular features like back references) you can do the following:

  1. Transform the regexen into finite automata (the algorithm for that can be found here(chapter 9) for example).
  2. Build the intersection of the automata (You have a state for each state in the cartesian product of the states of the two automata. You then transition between the states according to the original automata's transition rules. E.g. if you're in state x1y2, you get the input a, the first automaton has a transition x1->x4 for input x and the second automaton has y2->y3, you transition into the state x4y3).
  3. Check whether there's a path from the start state to the end state in the new automaton. If there is, the two regexen intersect, otherwise they don't.
sepp2k
+1  A: 

A regular expression specifies a finite state machine that can recognize a potentially infinite set of strings. The set of strings may be infinite but the number of states must be finite, so you can examine the states one by one.

Taking your second example: In the first expression, to get from state 0 to state 1, the string must start with a digit. In the second expression, to get from state 0 to state 1, the string must start with a letter. So you know that there is no string that will get you from state 0 to state 1 in BOTH expressions. You have the answer.

Taking the first example: You know that if the string starts with a digit you can get from state 0 to state 1 with either regular expression. So now you can eliminate state 0 for each, and just answer the question for each of the two (now one state smaller) finite-state-machines.

This looks a lot like the well-known "halting problem", which as you know is unsolvable in the general case for a Turing machine or equivalent. But in fact the halting problem IS solvable for a finite-state machine, simply because there are a finite number of states.

I believe you could solve this with a non-deterministic FSM. If your regex had only one transition from each state to the next, a deterministic FSM could solve it. But a regular expression means that for instance if you are in state 2, then if the caracter is a digit you go to state 3, else if the character is a letter you go to state 4.

So here's what I would do:

  1. Solve it for the subset of FSM's that have only one transition from one state to the next. For instance a regex that matches both "Bob" and "bob", and a second regex that matches only "bob" and "boB".

  2. See if you can implement the solution in a finite state machine. I think this should be possible. The input to a state is a pair representing a transition for one FSM and a transition for the second one. For instance: State 0: If (r1, r2) is (("B" or "b"), "b") then State 1. State 1: If (r1, r2) is (("o"), ("o")) then state 2. etc.

  3. Now for the more general case, where for instance state two goes back to state two or an earlier state; for example, regex 1 recognizes only "meet" but regex 2 recognizes "meeeet" with an unlimited number of e's. You would have to reduce them to regex 1 recognizing "t" and regex 2 recognizing "t". I think this may be solvable by a non-deterministic FSM.

That's my intuition anyway.

I don't think it's NP-complete, just because my intuition tells me you should be able to shorten each regex by one state with each step.

Mark Lutton
As sepp2k pointed out, this only applies to regexes that really DO recognize only regular expressions.
Mark Lutton