Nathan Fellman and mobrule are both suggesting a common practice: Normalization.
It's often simpler to process data so that it conforms to expected norms of content and structure, before doing the actual computation that is the main goal of the program or subroutine.
The Markov chain program was interesting, so I decided to play with it.
Here's a version that allows you to control the number of layers in the Markov chain. By changing $DEPTH
you can adjust the order of the simulation.
I broke the code into reusable subroutines. You can modify the normalization rules by changing the normalization routines. You can also generate a chain based on a defined set of values.
The code to generate the multi-layer state table was the most interesting bit. I could have used Data::Diver, but I wanted to work it out myself.
The word normalization code really should allow the normalizer to return a list of words to process, rather than just a single word--but I don't feel like fixing it now can return a list of words.. Other things like serializing your processed corpus would be good, and using Getopt::Long for command line switches remain to do. I only did the fun bits.
It was a bit of a challenge for me to write this without using objects--this really felt like a good place to make a Markov generator object. I like objects. But, I decided to keep the code procedural so it would retain the spirit of the original.
Have fun.
#!/usr/bin/perl
use strict;
use warnings;
use IO::Handle;
use constant NONWORD => "-";
my $MAXGEN = 10000;
my $DEPTH = 2;
my %state_table;
process_corpus( \*ARGV, $DEPTH, \%state_table );
generate_markov_chain( \%state_table, $MAXGEN );
sub process_corpus {
my $fh = shift;
my $depth = shift;
my $state_table = shift || {};;
my @history = (NONWORD) x $depth;
while( my $raw_line = $fh->getline ) {
my $line = normalize_line($raw_line);
next unless defined $line;
my @words = map normalize_word($_), split /\s+/, $line;
for my $word ( @words ) {
next unless defined $word;
add_word_to_table( $state_table, \@history, $word );
push @history, $word;
shift @history;
}
}
add_word_to_table( $state_table, \@history, NONWORD );
return $state_table;
}
# This was the trickiest to write.
# $node has to be a reference to the slot so that
# autovivified items will be retained in the $table.
sub add_word_to_table {
my $table = shift;
my $history = shift;
my $word = shift;
my $node = \$table;
for( @$history ) {
$node = \${$node}->{$_};
}
push @$$node, $word;
return 1;
}
# Replace this with anything.
# Return undef to skip a word
sub normalize_word {
my $word = shift;
$word =~ s/[^A-Z]//g;
return length $word ? $word : ();
}
# Replace this with anything.
# Return undef to skip a line
sub normalize_line {
return uc shift;
}
sub generate_markov_chain {
my $table = shift;
my $length = shift;
my $history = shift || [];
my $node = $table;
unless( @$history ) {
while(
ref $node eq ref {}
and
exists $node->{NONWORD()}
) {
$node = $node->{NONWORD()};
push @$history, NONWORD;
}
}
for (my $i = 0; $i < $MAXGEN; $i++) {
my $word = get_word( $table, $history );
last if $word eq NONWORD;
print "$word\n";
push @$history, $word;
shift @$history;
}
return $history;
}
sub get_word {
my $table = shift;
my $history = shift;
for my $step ( @$history ) {
$table = $table->{$step};
}
my $word = $table->[ int rand @$table ];
return $word;
}
Update:
I fixed the above code to handle multiple words coming back from the normalize_word()
routine.
To leave case intact and treat punctuation symbols as words, replace normalize_line()
and normalize_word()
:
sub normalize_line {
return shift;
}
sub normalize_word {
my $word = shift;
# Sanitize words to only include letters and ?,.! marks
$word =~ s/[^A-Z?.,!]//gi;
# Break the word into multiple words as needed.
my @words = split /([.?,!])/, $word;
# return all non-zero length words.
return grep length, @words;
}
The other big lurking gotcha is that I used -
as the NONWORD character. If you want to include a hyphen as a punctuation symbol, you will need to change the NONWORD constant definition at line 8. Just choose something that can never be a word.