Hello, So not to reinvent the wheel, I would like to know what has already been done about generating random statements from a context-free language (like those produced by yacc, etc.). These grammars are primarily for parsing, but maybe someone has done some generation for testing the parsers? Thanks
There's an ancient but still interesting article here that shows why you need a few more constraints for effective generation of random sentences than you do for parsing -- it also suggests a simple way of providing those extra constraints and gives a complete example program (...in Fortran IV... but, hey, it is over 40 years old...!-).
Unfortunately I am not aware of any more recent work on this subject (or implementations in more modern languages -- but transcoding the Fortran dusty deck to whatever language you like best shouldn't be as hard as coming up with these concepts on one's own;-) -- this is just the kind of already-ancient stuff I perused in actual paper-based libraries back when I was in college, and I'm amazed that the ACM's online search facilities allowed me to locate the reference I vaguely remembered, so rapidly (kudos, ACM!-). I have never done any original research on this subject.
Check out this blog post. Basically, it randomizes the RHS chosen at each rule application.
Generation of a sequence of random tokens like this is sort of strightforward; starting with the goal symbol, pick any random expansion of unfilled nonterminals. You probably want some kine of branch-and-bound search down any branch of generated parse tree so you can control depth/size.
But I don't see a lot of value in testing parsers this way, at least not if your parser generator accepts the context free langauge description directly. This happens when you use a full context free parser generator/tool such as GLR (which is what we use in our program transformation system, DMS) or an Earley parser.
You have another problem: if you want to test a parser, you need to feed it what it want, and surely that isn't tokens. Now you have to generate valid lexemes for the terminal leaves. That's not too hard either, but it you wanted to be purist about this approach you'd write your grammar in scannerless style.
But the last problem is that it is hard to find a context free grammar for most langauges. So now you have to debug your gold grammmar too; how will you do that? Once you get to that stage, you're like to just give up, debug the parser, and get on with your life.
Generation of random subtrees is useful in error recovery, if you can splice in the random tree to fix the source stream. We do a simplified form of this for DMS's parser, which means we have only one piece of error handling machinery.