views:

913

answers:

5

I'm having trouble with the NLTK under Python, specifically the .generate() method.

generate(self, length=100)

Print random text, generated using a trigram language model.

Parameters:

   * length (int) - The length of text to generate (default=100)

Here is a simplified version of what I am attempting.

import nltk

words = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(3)

This will always generate

Building ngram index...
The quick brown
None

As opposed to building a random phrase out of the words.

Here is my output when I do

print text.generate()

Building ngram index...
The quick brown fox jumps over the lazy dog fox jumps over the lazy
dog dog The quick brown fox jumps over the lazy dog dog brown fox
jumps over the lazy dog over the lazy dog The quick brown fox jumps
over the lazy dog fox jumps over the lazy dog lazy dog The quick brown
fox jumps over the lazy dog the lazy dog The quick brown fox jumps
over the lazy dog jumps over the lazy dog over the lazy dog brown fox
jumps over the lazy dog quick brown fox jumps over the lazy dog The
None

Again starting out with the same text, but then varying it. I've also tried using the first chapter from Orwell's 1984. Again that always starts with the first 3 tokens (one of which is a space in this case) and then goes on to randomly generate text.

What am I doing wrong here?

A: 

Maybe you can sort the tokens array randomly before generating a sentence.

Geo
The NLTK uses the context of words to determine their use. For instance they have the entire text of 'Moby Dick' in the NLTK for example purposes. Using generate with that will generate Meville sounding sentences. So unless you know something I don't, I assume that you don't want to resort the words because the initial context is significant.
James McMahon
you are right. If you shuffle the words you loose the information that trigrams are all about.
Mastermind
A: 

Are you sure that using word_tokenize is the right approach?

This Google groups page has the example:

>>> import nltk
>>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
>>> text.generate()

But I've never used nltk, so I can't say whether that works the way you want.

Mark Rushakoff
nltk.corpus.brown.words() is just a collection of words that comes with NLTK. I'm trying to seed the generator with my own words.
James McMahon
Have you compared your own tokenlist with the brown corpus?
Mastermind
A: 

Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.

Mastermind
Well that was a sample for the purposes of SO. My actual sample is larger. So do you need punctuation to offset sentences?
James McMahon
I thoght so, but if you already tried an entire Orwell chapter (with punctuation I assume) I guess I was wrong.
Mastermind
A: 

To generate random text, U need to use Markov Chains

code to do that: from here

import random

class Markov(object):

  def __init__(self, open_file):
    self.cache = {}
    self.open_file = open_file
    self.words = self.file_to_words()
    self.word_size = len(self.words)
    self.database()


  def file_to_words(self):
    self.open_file.seek(0)
    data = self.open_file.read()
    words = data.split()
    return words


  def triples(self):
    """ Generates triples from the given data string. So if our string were
 "What a lovely day", we'd generate (What, a, lovely) and then
 (a, lovely, day).
    """

    if len(self.words) < 3:
      return

    for i in range(len(self.words) - 2):
      yield (self.words[i], self.words[i+1], self.words[i+2])

  def database(self):
    for w1, w2, w3 in self.triples():
      key = (w1, w2)
      if key in self.cache:
 self.cache[key].append(w3)
      else:
 self.cache[key] = [w3]

  def generate_markov_text(self, size=25):
    seed = random.randint(0, self.word_size-3)
    seed_word, next_word = self.words[seed], self.words[seed+1]
    w1, w2 = seed_word, next_word
    gen_words = []
    for i in xrange(size):
      gen_words.append(w1)
      w1, w2 = w2, random.choice(self.cache[(w1, w2)])
    gen_words.append(w2)
    return ' '.join(gen_words)

Explaination: Generating pseudo random text with Markov chains using Python

Lakshman Prasad
+1  A: 

You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.

In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.

Ranieri