Hello. I would like to know if there is software that, given a regex and of course some other constraints like length, produces random text that always matches the given regex. Thanks
Instead of starting from a regexp, you should be looking into writing a small context free grammer, this will allow you to easily generate such random text. Unfortunately, I know of no tool which will do it directly for you, so you would need to do a bit of code yourself to actually generate the text. If you have not worked with grammers before, I suggest you read a bit about bnf format and "compiler compilers" before proceeding...
I'm not aware of any, although it should be possible. The usual approach is to write a grammar instead of a regular expression, and then create functions for each non-terminal that randomly decide which production to expand. If you could post a description of the kinds of strings that you want to generate, and what language you are using, we may be able to get you started.
There's an alternative to using a RegEx: define a placeholder expression consisting of fixed text and substituable parameters (abc{0}def{1}ghi{2}) and generate the substitutions from a random string from a fixed alphabet, like so:
public static string GenerateRandomAlphaString(int length)
{
const string alpha =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
return GenerateRandomString(length, alpha);
}
public static string GenerateRandomString(int length, string alphabet)
{
int maxlen = alphabet.Length;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
sb.Append(alphabet[random.Next(0, maxlen)]);
}
return sb.ToString();
}
We did something similar in Python not too long ago for a RegEx game that we wrote. We had the constraint that the regex had to be randomly generated, and the selected words had to be real words. You can download the completed game EXE here, and the Python source code here.
Here is a snippet:
def generate_problem(level):
keep_trying = True
while(keep_trying):
regex = gen_regex(level)
# print 'regex = ' + regex
counter = 0
match = 0
notmatch = 0
goodwords = []
badwords = []
num_words = 2 + level * 3
if num_words > 18:
num_words = 18
max_word_length = level + 4
while (counter < 10000) and ((match < num_words) or (notmatch < num_words)):
counter += 1
rand_word = words[random.randint(0,max_word)]
if len(rand_word) > max_word_length:
continue
mo = re.search(regex, rand_word)
if mo:
match += 1
if len(goodwords) < num_words:
goodwords.append(rand_word)
else:
notmatch += 1
if len(badwords) < num_words:
badwords.append(rand_word)
if counter < 10000:
new_prob = problem.problem()
new_prob.title = 'Level ' + str(level)
new_prob.explanation = 'This is a level %d puzzle. ' % level
new_prob.goodwords = goodwords
new_prob.badwords = badwords
new_prob.regex = regex
keep_trying = False
return new_prob
Check out the RandExp Ruby gem. It does what you want, though only in a limited fashion. (It won't work with every possible regexp, only regexps which meet some restrictions.)
Well here is the actual problem: I want to generate random but valid SWIFT MT messages. In fact I would be ok if i could generate fields of these messages separately as a start and not whole messages. These fields are in a format mostly XXX:YYYY:ZZZZZZZ (very abstract description but details are not important) where XXX is given, YYYY belongs to a predefined set of literals and depending on the YYYY literal, ZZZZZZZ has a specific format(easy to express). I can express this as a regex so I was thinking if I can avoid the part that given a regex for ZZZZZZ it generates random text that matches the ZZZZZZ format. This format is mostly datetimes, amounts, or another predefined set of literals.
All regular expressions can be expressed as context free grammars. And there is a nice algorithm already worked out for producing random sentences, from any CFG, of a given length. So upconvert the regex to a cfg, apply the algorithm, and wham, you're done.
You guys are fast. I need to do a one-two days (after work) search to be able to follow up. I am considering researching practical feasibility of Jay's propposed solution. Seeing how easy is to upconvert to CFG and if there are any impls of the algo he links to, or any other path from there. If not possible then i am going for RandExp solution that Pistos proposed, although I would prefer a Java solution.
Xeger is capable of doing it:
String regex = "[ab]{4,6}c";
Xeger generator = new Xeger(regex);
String result = generator.generate();
assert result.matches(regex);