ansaurus

Question

Generating random sentences from custom text in Python's NLTK?

Answer 1

A:

Maybe you can sort the tokens array randomly before generating a sentence.

Geo 2009-07-19 15:47:09

The NLTK uses the context of words to determine their use. For instance they have the entire text of 'Moby Dick' in the NLTK for example purposes. Using generate with that will generate Meville sounding sentences. So unless you know something I don't, I assume that you don't want to resort the words because the initial context is significant.

James McMahon 2009-07-19 15:56:49

you are right. If you shuffle the words you loose the information that trigrams are all about.

Mastermind 2009-07-20 17:07:50

Answer 2

A:

Are you sure that using word_tokenize is the right approach?

This Google groups page has the example:

>>> import nltk
>>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
>>> text.generate()

But I've never used nltk, so I can't say whether that works the way you want.

Mark Rushakoff 2009-07-19 16:07:12

nltk.corpus.brown.words() is just a collection of words that comes with NLTK. I'm trying to seed the generator with my own words.

James McMahon 2009-07-19 20:34:47

Have you compared your own tokenlist with the brown corpus?

Mastermind 2009-07-20 17:06:47

Answer 3

A:

Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.

Mastermind 2009-07-19 16:35:15

Well that was a sample for the purposes of SO. My actual sample is larger. So do you need punctuation to offset sentences?

James McMahon 2009-07-19 20:06:48

I thoght so, but if you already tried an entire Orwell chapter (with punctuation I assume) I guess I was wrong.

Mastermind 2009-07-20 17:05:37

Answer 4

A:

To generate random text, U need to use Markov Chains

code to do that: from here

import random

class Markov(object):

  def __init__(self, open_file):
    self.cache = {}
    self.open_file = open_file
    self.words = self.file_to_words()
    self.word_size = len(self.words)
    self.database()


  def file_to_words(self):
    self.open_file.seek(0)
    data = self.open_file.read()
    words = data.split()
    return words


  def triples(self):
    """ Generates triples from the given data string. So if our string were
 "What a lovely day", we'd generate (What, a, lovely) and then
 (a, lovely, day).
    """

    if len(self.words) < 3:
      return

    for i in range(len(self.words) - 2):
      yield (self.words[i], self.words[i+1], self.words[i+2])

  def database(self):
    for w1, w2, w3 in self.triples():
      key = (w1, w2)
      if key in self.cache:
 self.cache[key].append(w3)
      else:
 self.cache[key] = [w3]

  def generate_markov_text(self, size=25):
    seed = random.randint(0, self.word_size-3)
    seed_word, next_word = self.words[seed], self.words[seed+1]
    w1, w2 = seed_word, next_word
    gen_words = []
    for i in xrange(size):
      gen_words.append(w1)
      w1, w2 = w2, random.choice(self.cache[(w1, w2)])
    gen_words.append(w2)
    return ' '.join(gen_words)

Explaination: Generating pseudo random text with Markov chains using Python

Lakshman Prasad 2009-07-20 18:48:31

Answer 5

+1 A:

You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.

In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.

Ranieri 2009-09-26 15:50:57

ansaurus

tags:

views:

answers:

Generating random sentences from custom text in Python's NLTK?

related questions