tags:

views:

1646

answers:

9

Hello. I would like to know if there is software that, given a regex and of course some other constraints like length, produces random text that always matches the given regex. Thanks

A: 

Instead of starting from a regexp, you should be looking into writing a small context free grammer, this will allow you to easily generate such random text. Unfortunately, I know of no tool which will do it directly for you, so you would need to do a bit of code yourself to actually generate the text. If you have not worked with grammers before, I suggest you read a bit about bnf format and "compiler compilers" before proceeding...

kasperjj
+1  A: 

I'm not aware of any, although it should be possible. The usual approach is to write a grammar instead of a regular expression, and then create functions for each non-terminal that randomly decide which production to expand. If you could post a description of the kinds of strings that you want to generate, and what language you are using, we may be able to get you started.

Glomek
A: 

There's an alternative to using a RegEx: define a placeholder expression consisting of fixed text and substituable parameters (abc{0}def{1}ghi{2}) and generate the substitutions from a random string from a fixed alphabet, like so:

public static string GenerateRandomAlphaString(int length)
{
    const string alpha = 
        "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

    return GenerateRandomString(length, alpha);
}


public static string GenerateRandomString(int length, string alphabet)
{
    int maxlen = alphabet.Length;

    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < length; i++)
    {
        sb.Append(alphabet[random.Next(0, maxlen)]);
    }

    return sb.ToString();
}
Mitch Wheat
+1  A: 

We did something similar in Python not too long ago for a RegEx game that we wrote. We had the constraint that the regex had to be randomly generated, and the selected words had to be real words. You can download the completed game EXE here, and the Python source code here.

Here is a snippet:

def generate_problem(level):
  keep_trying = True
  while(keep_trying):
    regex = gen_regex(level)
    # print 'regex = ' + regex
    counter = 0
    match = 0
    notmatch = 0
    goodwords = []
    badwords = []
    num_words = 2 + level * 3
    if num_words > 18:
      num_words = 18
    max_word_length = level + 4
    while (counter < 10000) and ((match < num_words) or (notmatch < num_words)):
      counter += 1
      rand_word = words[random.randint(0,max_word)]
      if len(rand_word) > max_word_length:
        continue
      mo = re.search(regex, rand_word)
      if mo:
        match += 1
        if len(goodwords) < num_words:
          goodwords.append(rand_word)
      else:
        notmatch += 1
        if len(badwords) < num_words:
          badwords.append(rand_word)
    if counter < 10000:
      new_prob = problem.problem()
      new_prob.title = 'Level ' + str(level)
      new_prob.explanation = 'This is a level %d puzzle. ' % level
      new_prob.goodwords = goodwords
      new_prob.badwords = badwords
      new_prob.regex = regex
      keep_trying = False
      return new_prob
HanClinto
+4  A: 

Check out the RandExp Ruby gem. It does what you want, though only in a limited fashion. (It won't work with every possible regexp, only regexps which meet some restrictions.)

Pistos
It's moved: http://github.com/benburkert/randexp
martin clayton
A: 

Well here is the actual problem: I want to generate random but valid SWIFT MT messages. In fact I would be ok if i could generate fields of these messages separately as a start and not whole messages. These fields are in a format mostly XXX:YYYY:ZZZZZZZ (very abstract description but details are not important) where XXX is given, YYYY belongs to a predefined set of literals and depending on the YYYY literal, ZZZZZZZ has a specific format(easy to express). I can express this as a regex so I was thinking if I can avoid the part that given a regex for ZZZZZZ it generates random text that matches the ZZZZZZ format. This format is mostly datetimes, amounts, or another predefined set of literals.

Paralife
Could you please post the regexp that you have worked out?
Glomek
+8  A: 

All regular expressions can be expressed as context free grammars. And there is a nice algorithm already worked out for producing random sentences, from any CFG, of a given length. So upconvert the regex to a cfg, apply the algorithm, and wham, you're done.

Jay Kominek
Any known implementation of the algo? Is this a long shot?
Paralife
I successfully implemented it in Perl years ago, and it saw 'production' use, so I probably did it right. The hardest part of the process was understanding the notation used in the paper. Clear that hurdle and you're golden.
Jay Kominek
If I figure out where the Perl is, I'll cough it up, but don't count on anything.
Jay Kominek
Hm, couldn't recursive matches (Perl has them) and conditionals work together in creating something that isn't even context-free anymore?
Joey
A: 

You guys are fast. I need to do a one-two days (after work) search to be able to follow up. I am considering researching practical feasibility of Jay's propposed solution. Seeing how easy is to upconvert to CFG and if there are any impls of the algo he links to, or any other path from there. If not possible then i am going for RandExp solution that Pistos proposed, although I would prefer a Java solution.

Paralife
Ruby can run under Java: [JRuby](http://jruby.codehaus.org/). So it can still integrate well with whatever Java code you already have.
Pistos
+2  A: 

Xeger is capable of doing it:

String regex = "[ab]{4,6}c";
Xeger generator = new Xeger(regex);
String result = generator.generate();
assert result.matches(regex);
Wilfred Springer