views:

77

answers:

2

I'm looking for tools for generating random but realistic text. I've implemented a Markov Chain text generator myself and while the results were promising, my attempts at improving them haven't yielded any great successes.

I'd be happy with tools that consume a corpus or that operate based on a context-sensitive or context-free grammar. I'd like the tool to be suitable for inclusion into another project. Most of my recent work has been in Java so a tool in that language is preferred, but I'd be OK with C#, C, C++, or even JavaScript.

This is similar to this question, but larger in scope.

A: 

Something like as this Lorem ipsum generator? There's are links to several API's as well.

BalusC
Very similar, but I'm looking for one that can consume a corpus of text and generate random but similar text. I apologize, I should have been more clear in the question.
Carl Summers
+2  A: 

Extending your own Markov chain generator is probably your best bet, if you want "random" text. Generating something that has context is an open research problem.

Try (if you haven't):

  • Tokenising punctuation separately, or include punctuation in your chain if you're not already. This includes paragraph marks.
  • If you're using a 2- or 3- history Markov chain, try resetting to using a 1-history one when you encounter full stops or newlines.


Alternatively, you could use WordNet in two passes with your corpus:

  1. Analyse sentences to determine common sequences of word types, ie nouns, verbs, adjectives, and adverbs. WordNet includes these. Everything else (pronouns, conjunctions, whatever) is excluded, but you could essentially pass those straight through. This would turn "The quick brown fox jumps over the lazy dog" into "The [adjective] [adjective] [noun] [verb(s)] over the [adjective] [noun]"
  2. Reproduce sentences by randomly choosing a template sentence and replacing [adjective], [nouns] and [verbs] with actual adjectives nouns and verbs.

There are quite a few problems with this approach too: for example, you need context from the surrounding words to know which homonym to choose. Looking up "quick" in wordnet yields the stuff about being fast, but also the bit of your fingernail.


I know this doesn't solve your requirement for a library or a tool, but might give you some ideas.

kibibu