views:

82

answers:

3

If I have a list of regular expressions, is there an easy way to determine that no two of them will both return a match for the same string?

That is, the list is valid if and only if for all strings a maximum of one item in the list will match the entire string.

It seems like this will be very hard (maybe impossible?) to prove definitively, but I can't seem to find any work on the subject.

The reason I ask is that I am working on a tokenizer that accepts regexes, and I would like to ensure only one token at a time can match the head of the input.

+1  A: 

The Wkipedia article on regular expressions does state

It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).

but gives no further hints.

Of course the easy way you are after is to run a lot of tests -- but we all know the shortcomings of testing as a method of proof.

High Performance Mark
I believe CaptnCraig wants to know if the languages have a non-empty intersection, not whether they're identical.
Jim Lewis
+1  A: 

You can't do that by only looking at the regular expression.

Consider the case where you have [0-9] and [0-9]+. They are obviously different expressions, but when applied to the string "1", they both produce the same result. When applied to string "11" they produce different results.

The point is that a regular expression isn't enough information. The result depends both on the regex and the target string.

Seth
*"When applied to string "11" they produce different results. "* actually: they don't, they produce the same result. Unless you add anchoring.
Abel
For pure regular expressions, what CaptnCraig asks is possible (but may be inefficient). He wants to know if two regular languages (specified by regular expressions) have a non-empty intersection. For your example the answer is "yes".
Jim Lewis
@Abel: I think what he meant is that the part of the string they match is different. They'll both match though.
Matti Virkkunen
Sorry, my question was bad. Maybe what I meant to ask is that only one matches at position 0.
CaptnCraig
@Jim: how do you find that it is possible to find the intersection of two infinite sets? Can you elaborate? @Matti: indeed, they match different parts (unless you have a non-greedy engine, but that's rare, but it happens) :)
Abel
@CaptnCraig: perhaps you can update your q. with examples? Is "position 0" the same as the first character? Or the first string?
Abel
@Abel: Convert each regex to a DFA, combine the DFAs to produce a DFA recognizing the intersection, then check if the intersection is empty. Regular languages are closed under intersection, and this can be proved in finite time even if the languages are infinite.
Jim Lewis
I was under the impression that the regexes were allowed to overlap, but that given the input strings, they were only allowed to match one of them, and only one of them. While you prove (non)intersection, intersection might be allowed if the rule is based on the input strings. I.e., `k*` and `j*` are good if input is `cat, khaki, joe`, while the regexps do intersect.
Abel
+4  A: 

If you're working with pure regular expressions (no backreferences or other features that cause them to recognize context-free or more complicated languages), what you ask is possible. What you can do is convert each regex to a DFA, then (since regular languages are closed under intersection) combine them into a DFA that recognizes the intersection of the two languages. If that DFA has a path from the start state to an accepting state, that string is accepted by both input regexen.

The problem with this is that the first step of the usual regex->DFA algorithm is to convert the regex to a NFA, then convert the NFA to a DFA. But that last step can result in an exponential blowup in the number of DFA states, so this will only be feasible for very simple regular expressions.

If you are working with extended regex syntax, all bets are off: context-free languages are not closed under intersection, so this method won't work.

Jim Lewis
Intriguing thought. I think you're effectively crossing swords with Jeffrey Friedl, who said (page 157) "Talking about DFA matching is very boring". You just made it interesting again (accept that DFA still limits you highly)!
Abel
Thats what I feared. Very interesting answer though.
CaptnCraig