tags:

views:

481

answers:

3

Is there a way to test if a regular expression "contains" another regular expression?
For example:

RegEX1 = "a.*b";
RegEx2 = "a1.*b";

RegEX1 "contains" RegEX2.

As far as I know - this can't be done, am I wrong?


OK, joel.neely has shown that it can be done (haven't read it yet...) academically.

Can it be done in a programming language, say C# ?
How effective will that be? How long will it take to test 1000 pairs?

A: 

Surely applying RegEX1 to RegEx2 would give a match (although that approach would only work in trivial cases like those given)

Rowland Shaw
That would not work in general - RexEx1 would treat RegEx2's metacharacters as normal characters.
Avi
Hence the comment saying it only works in trivial cases, such as that given
Rowland Shaw
+5  A: 

Yes.

This paper contains a detailed discussion of the topic (see section 4.4).

joel.neely
Can you clarify your "yes". I think you are saying "Yes, you are wrong" and citing the paper that shows how it can be done (from a quick look at the paper). But it would be worth spelling that out explicitly.
Jonathan Leffler
The paper mentioned only says "It is a well known result that for two regular expressions B and R, it is readily decidable whether B subsumes R" and then goes on to describe "content models." Also, the paper's method appears to be simply enumerating all strings with length < n (calculated somehow?) and checking whether they are in the second expression but not the first. Decidable, perhaps, but not exactly feasible with 26^n options even without considering case and punctuation.
Clueless
A: 

Converting the two expressions to the equivalent state machines, and checking all paths in both machines allow the same matches, should do the trick. The pumping lemme should obviously be minded, so avoid revisiting old nodes.

It would only work for "simple" regular expressions (or real, what have you, perls recursive expressions are much more expressive).

While a graph of the state machine could have a large number of paths, it should still be limited (esp if the source for the expressions are human). So you'd find all the allowable paths of RegEX1, and check, each in turn, if it's allowable in RegEX2. If all paths are valid, you'd know that the one is contained in the other.

Svend
Is it possible (in a reasonable time) to run a test to get a hierarchy of regular expression (several hundreds of them)? can you provide pointers to code that does that?
Dror
I don't see why it would not possible, and in decent time as well. You'd have to build this from scratch though, which is not trivial.
Svend
checking "all paths are valid" for all pairs would probably take a very long time. "checking all paths in both machines" as you say may be infinite, or am i missing something?
Dror
1. You shouldn't need to revisit nodes due to the pumping lemma2. Not all pairs need to be checked. You find the paths in one, and see if it's allowable in the other. A large class of paths will obviously share parts in common, those need only be checked once, etc.
Svend