ansaurus

Question

Perl paragraph n-gram

Answer 1

+1 A:

Use the pos operator

pos SCALAR

Returns the offset of where the last m//g search left off for the variable in question ($_ is used when the variable is not specified).

and the @- special array

@LAST_MATCH_START

@-

$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.

For example, the program below grabs each pair's second word in its own capture and rewinds the match's position so what was the second word will be the next pair's first word:

#! /usr/bin/perl

use warnings;
use strict;

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;
while ($body =~ /(\w+ (\w+))/g) {
  ++$words{$1};
  pos($body) = $-[2];
}

for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
  print "'$_' => $words{$_}\n";
}

Output:

'the quick' => 1
'quick brown' => 1
'brown fox' => 1
'fox jumps' => 1
'jumps over' => 1
'over the' => 1
'the lazy' => 1
'lazy dog' => 1

Greg Bacon 2010-08-18 21:07:43

+0.4999... another 0.5 would be for the relevant documentation references to explain how this works. :)

Ether 2010-08-18 21:09:28

@Ether, he did link to the docs. Stack Overflow just doesn't display links inside `code` text very eye-catchingly.

cjm 2010-08-18 21:38:28

@cjm: indeed!` `

Ether 2010-08-18 22:14:55

Answer 2

+2 A:

I would use look-ahead to collect everything but the first word. That way, the position advances correctly automatically:

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

You could simplify it a bit if you want to stick with a single space instead of \s+ (don't forget to remove the /x modifier if you do that), since you could collect any number of words in $2, instead of using one group per word.

cjm 2010-08-18 21:28:19

Answer 3

+5 A:

While the described task might be interesting to code by hand, would not it be better to use an existing CPAN module that handles n-grams? It looks like Text::Ngrams (as opposed to Text::Ngram) can handle word-based n-gram analysis.

Grrrr 2010-08-18 21:35:44

Text::Ngrams did the trick perfectly. The fact that I can get n-grams of any size with minimal effort is helpful as well.

gms8994 2010-08-19 12:43:57

Answer 4

+2 A:

You can do something a little funky with lookaheads:

If I do:

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

That expression says to look ahead for two words (and capture them), but consume 1.

I get:

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

It seems I can generalize this by putting in a variable for count:

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;

Axeman 2010-08-18 21:43:03

+1 for mentioning the repeat count.

Jon Purdy 2010-08-18 22:30:57

Answer 5

+1 A:

Is there any particular reason for doing this using regexes alone? The obvious approach to me would to split the text into an array, then use a pair of nested loops to extract your counts from it. Something along the lines of:

#!/usr/bin/env perl

use strict;
use warnings;

my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;

my @words = split / /, $text;
my %counts;

for my $pos (0 .. $#words) {
  for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
    my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
    $counts{$phrase}++;
  }
} 

use Data::Dumper;
print Dumper(\%counts);

Output:

$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

Edit: Fixed $phrase_len loop to prevent use of negative indexes, which was causing incorrect results, per cjm's comment.

Dave Sherohman 2010-08-18 22:13:58

This doesn't handle the edges of the array properly. Note that your output includes phrases like "dog the" and "lazy dog the", which don't actually appear in the text.

cjm 2010-08-19 06:09:44

@cjm: Ack! I obviously didn't examine the output carefully enough. Still, not bad for a two-minute proof-of-concept. I've corrected the `$phrase_len` loop to fix this.

Dave Sherohman 2010-08-19 10:02:55

ansaurus

tags:

views:

answers:

Perl paragraph n-gram

pos SCALAR

@LAST_MATCH_START

@-

related questions