tags:

views:

78

answers:

5

Let's say I have a sentence of text:

$body = 'the quick brown fox jumps over the lazy dog';

and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords:

$words{$_}++ for $body =~ m/(\w+)/g;

After this is complete, I have a hash that looks like the following:

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

The next step, so that I can get 2-word keywords, is the following:

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

But that only gets every "other" pair; looking like this:

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

I also need the one word offset:

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

Is there an easier way to do this than the following?

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
+1  A: 

Use the pos operator

pos SCALAR

Returns the offset of where the last m//g search left off for the variable in question ($_ is used when the variable is not specified).

and the @- special array

@LAST_MATCH_START

@-

$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.

For example, the program below grabs each pair's second word in its own capture and rewinds the match's position so what was the second word will be the next pair's first word:

#! /usr/bin/perl

use warnings;
use strict;

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;
while ($body =~ /(\w+ (\w+))/g) {
  ++$words{$1};
  pos($body) = $-[2];
}

for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
  print "'$_' => $words{$_}\n";
}

Output:

'the quick' => 1
'quick brown' => 1
'brown fox' => 1
'fox jumps' => 1
'jumps over' => 1
'over the' => 1
'the lazy' => 1
'lazy dog' => 1
Greg Bacon
+0.4999... another 0.5 would be for the relevant documentation references to explain how this works. :)
Ether
@Ether, he did link to the docs. Stack Overflow just doesn't display links inside `code` text very eye-catchingly.
cjm
@cjm: indeed!` `
Ether
+2  A: 

I would use look-ahead to collect everything but the first word. That way, the position advances correctly automatically:

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

You could simplify it a bit if you want to stick with a single space instead of \s+ (don't forget to remove the /x modifier if you do that), since you could collect any number of words in $2, instead of using one group per word.

cjm
+5  A: 

While the described task might be interesting to code by hand, would not it be better to use an existing CPAN module that handles n-grams? It looks like Text::Ngrams (as opposed to Text::Ngram) can handle word-based n-gram analysis.

Grrrr
Text::Ngrams did the trick perfectly. The fact that I can get n-grams of any size with minimal effort is helpful as well.
gms8994
+2  A: 

You can do something a little funky with lookaheads:

If I do:

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

That expression says to look ahead for two words (and capture them), but consume 1.

I get:

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

It seems I can generalize this by putting in a variable for count:

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;
Axeman
+1 for mentioning the repeat count.
Jon Purdy
+1  A: 

Is there any particular reason for doing this using regexes alone? The obvious approach to me would to split the text into an array, then use a pair of nested loops to extract your counts from it. Something along the lines of:

#!/usr/bin/env perl

use strict;
use warnings;

my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;

my @words = split / /, $text;
my %counts;

for my $pos (0 .. $#words) {
  for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
    my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
    $counts{$phrase}++;
  }
} 

use Data::Dumper;
print Dumper(\%counts);

Output:

$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

Edit: Fixed $phrase_len loop to prevent use of negative indexes, which was causing incorrect results, per cjm's comment.

Dave Sherohman
This doesn't handle the edges of the array properly. Note that your output includes phrases like "dog the" and "lazy dog the", which don't actually appear in the text.
cjm
@cjm: Ack! I obviously didn't examine the output carefully enough. Still, not bad for a two-minute proof-of-concept. I've corrected the `$phrase_len` loop to fix this.
Dave Sherohman