views:

320

answers:

6

I think that the title accurately summarizes my question, but just to elaborate a bit, instead of using a regular expression to verify properties of existing strings, I'd like to use the regular expression as a way to generate strings that have certain properties.

The function doesn't need to generate every string that satisfies the regular expression (cause that would be an infinite number of string for a lot of regexes). Just a sampling of the many valid strings is sufficient.

How feasible is something like this? If the solution is too complicated/large, I'm happy with a general discussion/outline. Additionally, I'm interested in any existing programs or libraries (.net) that do this.

Thanks!

A: 

The easiest way to implement but definitely most CPU time intensive approach would be to simply brute force it. Set up a character table with the characters that your string should contain and then just sequentially generate strings and do a Regex.IsMatch on them.

Tom Frey
For a regex of any significant complexity this would probably take an unreasonable amount of processing power.
Rex M
That could easily finish after the end of time.
Jonas Elfström
Please don't do this.
Byron Whitlock
According to the back of my napkin, if you can do 100,000 regex tests per second, you will be able to test all 12-character strings possible on a standard US keyboard by roughly the same time the Sun will burn out.
Rex M
well, I didn't say that it will be fast or practical but it would be easy to implement and for short strings it would be a feasible approach :P
Tom Frey
@Rex M, You might upgrade your cpu before the sun burns out, so make sure the program can be halted+resumed
gnibbler
also, you can easily parallelize it, so e.g. on a dual quad core machine it would only take 1/8th of ~5.5bln years, not accounting for quantum computing in the near future
Tom Frey
+9  A: 

Well a regex is convertible to a DFA which can be thought of as a graph. To generate a string given this DFA-graph you'd just find a path from a start state to an end state. You'd just have to think about how you want to handle cycles (Maybe traverse every cycle at least once to get a sampling? n times?), but I don't see why it wouldn't work.

Falaina
+1 great answer. [ DFA = Deterministic Finite Autonoma ]
Byron Whitlock
Most modern regex engines supports backreferences and it seems that those could make it really hard to produce a match. http://perl.plover.com/NPC/
Jonas Elfström
http://osteele.com/tools/reanimator/??? - May help you along the path of realizing the conversion to a DFA
gnarf
Another way to look at this would be to convert the DFA to a markov model. That way taking a random walk through the DFA is easy. At each point you randomly choose a transition and you only stop when you reach a terminating state or you have a string of a desired length.A markov chain is very simple to represent as a square reachability matrix from states to states. If a state is reachable from some other state then there will be a non-zero value in their intersection in the matrix. To walk the DFA, just choose randomly from amongst the matches of the row of the current state.
Andrew Matthews
A: 

I, personally, believe that this is the holy grail of reg-ex. If you could implement this -- even only 3/4 working -- I have no doubt that you'd be rich in about 5 minutes.

All joking aside, I'm not sure that what you are truly going after is feasible. Reg-Ex is a very open, flexible language and giving the computer enough sample input to truly and accurately find what you need, is probably not feasible.

If I'm proven wrong, I wish kudos to that developer.

To look at this from a different perspective, this is almost (not quite) like giving a computer it's output, and having it -- based on that -- write a program for you. This is a little overboard, but it kind of illustrates my point.

Frank V
+2  A: 

This can be done by traversing the DFA (includes pseudocode) or else by walking the regex's abstract-syntax tree directly or converting to NFA first, as explained by Doug McIlroy: paper and Haskell code. (He finds the NFA approach to go faster, but he didn't compare it to the DFA.)

These all work on regular expressions without back-references -- that is, 'real' regular expressions rather than Perl regular expressions. To handle the extra Perl features it'd be easiest to add on a post-filter.

Darius Bacon
+1  A: 

This utility on UtilityMill will invert some simple regexen. It is based on this example from the pyparsing wiki. The test cases for this program are:

[A-EA]
[A-D]*
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
foobar\d\d
foobar{2}
foobar{2,9}
fooba[rz]{2}
(foobar){2}
([01]\d)|(2[0-5])
([01]\d\d)|(2[0-4]\d)|(25[0-5])
[A-C]{1,2}
[A-C]{0,3}
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}
@|TH[12]
@(@|TH[12])?
@(@|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9]))?
@(@|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|OH(1[0-9]?|2[0-9]?|30?|[4-9]))?
(([ECMP]|HA|AK)[SD]|HS)T
[A-CV]{2}
A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]
(a|b)|(x|y)
(a|b) (x|y)
Paul McGuire
+1  A: 

Since it is trivially possible to write a regular expression that matches no possible strings, and I believe it is also possible to write a regular expression for which calculating a matching string requires an exhaustive search of possible strings of all lengths, you'll probably need an upper bound on requesting an answer.

Randal Schwartz