views:

808

answers:

9

What's a common way of generating sentences from a grammar?

I want an algorithm that's sort of the opposite of a parser. That is, given a formal context-free grammar (say LL), I want to generate an arbitrary sentence that conforms to that grammar. I use sentence here to mean any valid body of text, so it can actually be a whole program (even if it doesn't make any sense—as long as it's syntactially correct).

Example grammar:

program   : <imports> NEWLINE? <namespace>
imports   : ("import" <identifier> NEWLINE)* 
namespace : "namespace " <identifier> NEWLINE "{" <classes> "}" 
identifier: (A-Za-z_) (A-Za-z0-9_)*
...

Example generated program:

import jkhbhhuob
import aaaaa888_

namespace u8nFGubgykb
{ class ui0op_np { ... }
}
A: 

not an answer, but check the wikipedia entry on grammar generation: http://en.wikipedia.org/wiki/Context-free_grammar_generation_algorithms

it described some common algorithms used.

dusoft
That's not what he's trying to do
Bearddo
Beardo's right—I want the other way around, i.e., to generate a string of text froma grammar.
Mark Cidade
+1  A: 

My first suggestion would be a breadth first search. Just set up a graph of rules and search through them. You'll start spitting out programs starting from the smallest possible ones, and slowly getting larger. You'll likely find, though, that your grammar will spit out exponentially more programs for a given number of rules and you'll likely not get past 30 or so tokens in a program using a DFS.

The problem with a depth first search is that the second you have a left-recursive rule, your search will get stuck in an infinite loop.

Another big problem is that syntactically correct programs are a long way from semantically correct programs. Generating the latter type is likely completely unfeasable in all but the most basic cases.

Eclipse
A: 

While the idea is nice (I have thought of it many times before), the reality is that without some sample data and/or tons of generator constraints/effort limits, it is quite a big job.

One might just find writing samples by hand easier. :)

leppie
For a lot of applications, a combination of hand-generated sequences and grammar-generated sequences hits the sweet spot. One application I was thinking of is to describe any kind of state machine and then generate a sequence of state transitions.
Mark Cidade
+2  A: 

Off the top of my head:

I'd work recursively (basically the opposite of a recursive decent parser) using some heuristics about what to do with ranges ((...): probably pick at random) optionals (?: see [], below), repetitions('' Poisson distribution?). Literals ("...") are simple written to the output, and subtokens (`<...>') generate a recursion.

This shouldn't be too hard unless you want to guarantee some sort of complete coverage. Even then, just generating a bunch of data would be a help...


[*] You need to include optionals less than 50% of the time to prevent infinite regress when processing rules like

 nonterm:  otherstuff <nonterm>?

Good catch by plinth.

Likewise with repetitions, throw a distributions that converges strongly.


You'll need to parse the input grammar first if it is presented in a BNF form as here. Simplest thing to do would be use a mapping (name, string), then start with the highest level token (which you might assume means the first one...).

This gives you:

("program", "<imports> NEWLINE? <namespace>")

("imports", ("import" <identifier> NEWLINE)*)

...

The you start with "program", hit "<imports>" so you recur...on coming back, hist "NEWLINE?", so throw the dice and write or not, hit "<namespace>" so recur...on return you're done.


I find my self suspecting that this has been done before. If you just need the output, I'd search the web... Perhaps http://portal.acm.org/citation.cfm?doid=966137.966142, though the huge number of parser generators out there clutter up the search space... Try this paper, too.

BTW-- You local university probably has online subscriptions to these journals, so you can get them for free by hooking up at the library.

dmckee
+1  A: 

I don't know that there's a "common" algorithm for doing this. Random program generation is used in genetic programming so you could look for a grammar based GP system and see how they handle program generation. I would do a recursive rule generation algorithm like the pseudo-code:

void GenerateRule(someRule)
{
  foreach (part in someRule.Parts)
  {
    if (part.IsLiteral) OutputLiteral(part);
    if (part.IsIdentifier) Output(GenerateIdentifier(part)));
    if (part.IsRule) GenerateRule(part.Rule);
  }
}

This assumes that you've read in all of the parts into some data structure. You'd also need to handle the repetitions(randomly generate the number of times they occur) and optional rules (flip a coin to see if they are there or not).


Edit: Oh, and if the rule has more than one option, you'd just pick one of the options to go with, and process it the same way. So if some rule was (Literal|Variable), you'd randomly pick between the two.

Bearddo
+1  A: 

The problem you will have is that the recursive nature of the graph is such that you can generate correct grammars of infinite size. You will probably want to do something like set up a hash of node types in your grammar with counts and limits of how many times you're allowing yourself to hit that node. Then depth first search to your heart's content.

plinth
Yup. That's why I suggested drawing the number of repeats from a converging function. It also why my initial suggestion (50/50) for optionals is wrong. Gotta go fix that now...
dmckee
+2  A: 

Your solution should follow the inductive structure of the grammar. How do you generate a random utterance for each of the following?

  • Terminal symbol
  • Nonterminal symbol
  • Sequence of right-hand sides
  • Choice of right-hand sides
  • Star closure of right-hand sides

This will all be much clearer if you write down the data structure you use to represent a grammar. The structure of your set of mutually recursive generator functions will mirror that data structure very closely.

Dealing with infinite recursion is a bit dicey. The easiest way is to generate a stream of utterances and keep a depth cutoff. Or if you're using a lazy language like Haskell you can generate all utterances and peel off as many finite ones as you like (a trickier problem than the original question, but very entertaining).

Norman Ramsey
+1  A: 

As usual, I’m going to advise against reinventing the wheel. I’ve written one of these for ARM assembler, but I’m on record as regretting it (Software: Practice and Experience April 2007):

“In retrospect, an off-the-shelf expression generator should have been used to generate random ARM assembly instructions for comparison. Instead a Perl script was built incrementally, taking each ARM instruction definition and generating instances. An advantage, however, of the incremental in-house approach was that simple substitutions detected simple bugs, and bug hunting could proceed incrementally.”

I’m afraid I don’t recall what made me change my mind, and I doubt it would be relevant to your specific needs, but I do suggest looking harder for a preexisting solution. It requires less discipline to write stuff like this yourself, but it always takes longer than you expect.

Flash Sheridan
+1  A: 

Here is a Python example using the NLTK:

from nltk import parse_cfg, ChartParser
from random import choice

def produce(grammar, symbol):
    words = []
    productions = grammar.productions(lhs = symbol)
    production = choice(productions)
    for sym in production.rhs():
        if isinstance(sym, str):
            words.append(sym)
        else:
            words.extend(produce(grammar, sym))
    return words

grammar = parse_cfg('''
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
V -> 'shot' | 'killed' | 'wounded'
Det -> 'an' | 'my' 
N -> 'elephant' | 'pajamas' | 'cat' | 'dog'
P -> 'in' | 'outside'
''')

parser = ChartParser(grammar)

gr = parser.grammar()
print ' '.join(produce(gr, gr.start()))

The example is adapted from the book. The sentences generated are syntactically correct but still total gibberish.

Björn Lindqvist