views:

212

answers:

3

How would I change the following markov script to treat capitalized and lowercase words as the same?

The entire idea is to help increase the quality of output of my markov text generator.

As it stands, if you plug 99 lowercase sentences into it and 1 capitalized sentence - you almost always find a non-markovized version of the capitalized sentence in the output.

# Copyright (C) 1999 Lucent Technologies
# Excerpted from 'The Practice of Programming'
# by Brian W. Kernighan and Rob Pike

# markov.pl: markov chain algorithm for 2-word prefixes

$MAXGEN = 10000;
$NONWORD = "\n";
$w1 = $w2 = $NONWORD;                    # initial state
while (<>)
{                                        # read each line of input
    foreach (split)
    {
      push(@{$statetab{$w1}{$w2}}, $_);
      ($w1, $w2) = ($w2, $_);        # multiple assignment
    }
}

push(@{$statetab{$w1}{$w2}}, $NONWORD);  # add tail
$w1 = $w2 = $NONWORD;

for ($i = 0; $i < $MAXGEN; $i++) 
{
    $suf = $statetab{$w1}{$w2};      # array reference
    $r = int(rand @$suf);            # @$suf is number of elems
    exit if (($t = $suf->[$r]) eq $NONWORD);
    print "$t\n";
    ($w1, $w2) = ($w2, $t);          # advance chain
}
+4  A: 

Convert all your input to lowercase before processing it?

See the lc function.

mobrule
The generated text would therefore be all lowercase as well, no? For example the two words "Mr. Feldman" would need to be capitalized in the output as well.
darkAsPitch
@darkAsPitch - it does. You would need to recapitalize after processing. Alternatively, you could selectively normalize, say, don't normalize a word in the middle of a sentence that starts with a capital letter, and don't normalize a word at the beginning of a sentence if you can recognize it as a proper noun, etc.
mobrule
The problem is that if you draw an equivalence between to words in the state table, say 'One' and 'one', how does the output formatter know how to capitalize the output? One option would be to leave case alone and classify punctuation marks as words. So, "He ate the cheese. The food was good." would have 9 different words, with only the '.' repeated. You have a lot of flexibility in determining what is a word. It could be individual letters, or even complete sentences. It really depends on how you want to use your output.
daotoad
@daotoad: "leave case alone and classify punctuation marks as words" - I believe that is the real answer I was looking for! You could then easily capitalize sentence heads and known names in another script. Is that the general idea here? I'm quite new to Perl. Now, how do I mark a correct answer for a comment? :P
darkAsPitch
+2  A: 

I think the best bet would be to lowercase (or uppercase) the words as soon as they're input:

while (<>)
{                                        # read each line of input
    lc; # convert $_ to lowercase
    # etc.
}
Nathan Fellman
The generated text would therefore be all lowercase as well, no? For example the two words "Mr. Feldman" would need to be capitalized in the output as well.
darkAsPitch
+5  A: 

Nathan Fellman and mobrule are both suggesting a common practice: Normalization.

It's often simpler to process data so that it conforms to expected norms of content and structure, before doing the actual computation that is the main goal of the program or subroutine.

The Markov chain program was interesting, so I decided to play with it.

Here's a version that allows you to control the number of layers in the Markov chain. By changing $DEPTH you can adjust the order of the simulation.

I broke the code into reusable subroutines. You can modify the normalization rules by changing the normalization routines. You can also generate a chain based on a defined set of values.

The code to generate the multi-layer state table was the most interesting bit. I could have used Data::Diver, but I wanted to work it out myself.

The word normalization code really should allow the normalizer to return a list of words to process, rather than just a single word--but I don't feel like fixing it now can return a list of words.. Other things like serializing your processed corpus would be good, and using Getopt::Long for command line switches remain to do. I only did the fun bits.

It was a bit of a challenge for me to write this without using objects--this really felt like a good place to make a Markov generator object. I like objects. But, I decided to keep the code procedural so it would retain the spirit of the original.

Have fun.

#!/usr/bin/perl
use strict;
use warnings;

use IO::Handle;

use constant NONWORD => "-";
my $MAXGEN = 10000;
my $DEPTH  = 2;

my %state_table;

process_corpus( \*ARGV, $DEPTH, \%state_table );
generate_markov_chain( \%state_table, $MAXGEN );


sub process_corpus {
    my $fh    = shift;
    my $depth = shift;
    my $state_table = shift || {};;

    my @history = (NONWORD) x $depth;


    while( my $raw_line = $fh->getline ) {

        my $line = normalize_line($raw_line);
        next unless defined $line;

        my @words = map normalize_word($_), split /\s+/, $line;
        for my $word ( @words ) {

            next unless defined $word; 

            add_word_to_table( $state_table, \@history, $word );
            push  @history, $word;
            shift @history;
        }

    }

    add_word_to_table( $state_table, \@history, NONWORD );

    return $state_table;
}

# This was the trickiest to write.
# $node has to be a reference to the slot so that 
# autovivified items will be retained in the $table.
sub add_word_to_table {
    my $table   = shift;
    my $history = shift;
    my $word    = shift;

    my $node = \$table;

    for( @$history ) {
        $node = \${$node}->{$_};
    }

    push @$$node, $word;

    return 1;
}

# Replace this with anything.
# Return undef to skip a word
sub normalize_word {
    my $word = shift;
    $word =~ s/[^A-Z]//g;
    return length $word ? $word : ();
}

# Replace this with anything.
# Return undef to skip a line
sub normalize_line {
    return uc shift;
}


sub generate_markov_chain {
    my $table   = shift;
    my $length  = shift;
    my $history = shift || [];

    my $node = $table;

    unless( @$history ) {

        while( 
            ref $node eq ref {}
                and
            exists $node->{NONWORD()} 
        ) {
            $node = $node->{NONWORD()};
            push @$history, NONWORD;
        }

    }

    for (my $i = 0; $i < $MAXGEN; $i++) {

        my $word = get_word( $table, $history );

        last if $word eq NONWORD;
        print "$word\n";

        push @$history, $word;
        shift @$history;
    }

    return $history;
}


sub get_word {
    my $table   = shift;
    my $history = shift;

    for my $step ( @$history ) {
        $table = $table->{$step};
    }

    my $word = $table->[ int rand @$table ];
    return $word;
}

Update: I fixed the above code to handle multiple words coming back from the normalize_word() routine.

To leave case intact and treat punctuation symbols as words, replace normalize_line() and normalize_word():

sub normalize_line {
    return shift;
}

sub normalize_word {
    my $word = shift;

    # Sanitize words to only include letters and ?,.! marks 
    $word =~ s/[^A-Z?.,!]//gi;

    # Break the word into multiple words as needed.
    my @words = split /([.?,!])/, $word;

    # return all non-zero length words. 
    return grep length, @words;
}

The other big lurking gotcha is that I used - as the NONWORD character. If you want to include a hyphen as a punctuation symbol, you will need to change the NONWORD constant definition at line 8. Just choose something that can never be a word.

daotoad
THANK YOU daotoad for this massive help. I'm not sure I completely understand it (yet) but I trust this is the right answer! One more question - this does not lowercase (normalize?) all of the output - or does it? I'm testing it out now!
darkAsPitch
Actually it uppercases all input (see normalize_line) and then strips out any non-alphabetic characters (see normalize_word). If you want lowercase output, just change the normalization functions. If you go lowercase, then you'll need to adjust the regex in normalize word to accept lowercase characters.
daotoad
@daotoad: could you PLEASE "leave case alone and classify punctuation marks as words" in your heavily modified example above? I have already marked your answer as correct - so there's no real benefit to yourself :P - but it would save me hours of work that I don't currently have to give! Thanks again - very insightful.
darkAsPitch