ansaurus

Question

How do I generate sentences from a formal grammar?

Answer 1

A:

not an answer, but check the wikipedia entry on grammar generation: http://en.wikipedia.org/wiki/Context-free_grammar_generation_algorithms

it described some common algorithms used.

dusoft 2009-03-02 19:46:00

That's not what he's trying to do

Bearddo 2009-03-02 19:49:02

Beardo's right—I want the other way around, i.e., to generate a string of text froma grammar.

Mark Cidade 2009-03-02 19:53:34

Answer 2

+1 A:

My first suggestion would be a breadth first search. Just set up a graph of rules and search through them. You'll start spitting out programs starting from the smallest possible ones, and slowly getting larger. You'll likely find, though, that your grammar will spit out exponentially more programs for a given number of rules and you'll likely not get past 30 or so tokens in a program using a DFS.

The problem with a depth first search is that the second you have a left-recursive rule, your search will get stuck in an infinite loop.

Another big problem is that syntactically correct programs are a long way from semantically correct programs. Generating the latter type is likely completely unfeasable in all but the most basic cases.

Eclipse 2009-03-02 19:46:27

Answer 3

A:

While the idea is nice (I have thought of it many times before), the reality is that without some sample data and/or tons of generator constraints/effort limits, it is quite a big job.

One might just find writing samples by hand easier. :)

leppie 2009-03-02 19:46:56

For a lot of applications, a combination of hand-generated sequences and grammar-generated sequences hits the sweet spot. One application I was thinking of is to describe any kind of state machine and then generate a sequence of state transitions.

Mark Cidade 2009-03-02 19:56:06

Answer 4

+2 A:

Off the top of my head:

I'd work recursively (basically the opposite of a recursive decent parser) using some heuristics about what to do with ranges ((...): probably pick at random) optionals (?: see [], below), repetitions('' Poisson distribution?). Literals ("...") are simple written to the output, and subtokens (`<...>') generate a recursion.

This shouldn't be too hard unless you want to guarantee some sort of complete coverage. Even then, just generating a bunch of data would be a help...

[*] You need to include optionals less than 50% of the time to prevent infinite regress when processing rules like

 nonterm:  otherstuff <nonterm>?

Good catch by plinth.

Likewise with repetitions, throw a distributions that converges strongly.

You'll need to parse the input grammar first if it is presented in a BNF form as here. Simplest thing to do would be use a mapping (name, string), then start with the highest level token (which you might assume means the first one...).

This gives you:

("program", "<imports> NEWLINE? <namespace>")

("imports", ("import" <identifier> NEWLINE)*)

...

The you start with "program", hit "<imports>" so you recur...on coming back, hist "NEWLINE?", so throw the dice and write or not, hit "<namespace>" so recur...on return you're done.

I find my self suspecting that this has been done before. If you just need the output, I'd search the web... Perhaps http://portal.acm.org/citation.cfm?doid=966137.966142, though the huge number of parser generators out there clutter up the search space... Try this paper, too.

BTW-- You local university probably has online subscriptions to these journals, so you can get them for free by hooking up at the library.

dmckee 2009-03-02 19:51:43

Answer 5

+1 A:

I don't know that there's a "common" algorithm for doing this. Random program generation is used in genetic programming so you could look for a grammar based GP system and see how they handle program generation. I would do a recursive rule generation algorithm like the pseudo-code:

void GenerateRule(someRule)
{
  foreach (part in someRule.Parts)
  {
    if (part.IsLiteral) OutputLiteral(part);
    if (part.IsIdentifier) Output(GenerateIdentifier(part)));
    if (part.IsRule) GenerateRule(part.Rule);
  }
}

This assumes that you've read in all of the parts into some data structure. You'd also need to handle the repetitions(randomly generate the number of times they occur) and optional rules (flip a coin to see if they are there or not).

Edit: Oh, and if the rule has more than one option, you'd just pick one of the options to go with, and process it the same way. So if some rule was (Literal|Variable), you'd randomly pick between the two.

Bearddo 2009-03-02 20:00:29

Answer 6

+1 A:

The problem you will have is that the recursive nature of the graph is such that you can generate correct grammars of infinite size. You will probably want to do something like set up a hash of node types in your grammar with counts and limits of how many times you're allowing yourself to hit that node. Then depth first search to your heart's content.

plinth 2009-03-02 20:02:47

Yup. That's why I suggested drawing the number of repeats from a converging function. It also why my initial suggestion (50/50) for optionals is wrong. Gotta go fix that now...

dmckee 2009-03-02 20:22:10

Answer 7

+2 A:

Your solution should follow the inductive structure of the grammar. How do you generate a random utterance for each of the following?

Terminal symbol
Nonterminal symbol
Sequence of right-hand sides
Choice of right-hand sides
Star closure of right-hand sides

This will all be much clearer if you write down the data structure you use to represent a grammar. The structure of your set of mutually recursive generator functions will mirror that data structure very closely.

Dealing with infinite recursion is a bit dicey. The easiest way is to generate a stream of utterances and keep a depth cutoff. Or if you're using a lazy language like Haskell you can generate all utterances and peel off as many finite ones as you like (a trickier problem than the original question, but very entertaining).

Norman Ramsey 2009-03-02 23:22:49

Answer 8

+1 A:

As usual, I’m going to advise against reinventing the wheel. I’ve written one of these for ARM assembler, but I’m on record as regretting it (Software: Practice and Experience April 2007):

“In retrospect, an off-the-shelf expression generator should have been used to generate random ARM assembly instructions for comparison. Instead a Perl script was built incrementally, taking each ARM instruction definition and generating instances. An advantage, however, of the incremental in-house approach was that simple substitutions detected simple bugs, and bug hunting could proceed incrementally.”

I’m afraid I don’t recall what made me change my mind, and I doubt it would be relevant to your specific needs, but I do suggest looking harder for a preexisting solution. It requires less discipline to write stuff like this yourself, but it always takes longer than you expect.

Flash Sheridan 2009-04-29 17:49:43

Answer 9

+1 A:

Here is a Python example using the NLTK:

from nltk import parse_cfg, ChartParser
from random import choice

def produce(grammar, symbol):
    words = []
    productions = grammar.productions(lhs = symbol)
    production = choice(productions)
    for sym in production.rhs():
        if isinstance(sym, str):
            words.append(sym)
        else:
            words.extend(produce(grammar, sym))
    return words

grammar = parse_cfg('''
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
V -> 'shot' | 'killed' | 'wounded'
Det -> 'an' | 'my' 
N -> 'elephant' | 'pajamas' | 'cat' | 'dog'
P -> 'in' | 'outside'
''')

parser = ChartParser(grammar)

gr = parser.grammar()
print ' '.join(produce(gr, gr.start()))

The example is adapted from the book. The sentences generated are syntactically correct but still total gibberish.

Björn Lindqvist 2010-07-20 16:11:00

ansaurus

tags:

views:

answers:

How do I generate sentences from a formal grammar?

related questions