This is similar to one of the Learning Perl exercises. The trick is to catch all of the repeated words, so you need a "one or more" quantifier on the duplication:
$str = 'This is Goethe the the the their sentence';
$str =~ s/\b((\w+)(?:\s+\2\b)+)/[\1]/g;
The features I'm about to use are described in either perlre, when they apply at a pattern, or perlop when they affect how the substitution operator does its work.
If you like the /x
flag to add insignificant whitespace and comments:
$str =~ s/
\b
(
(\w+)
(?:
\s+
\2
\b
)+
)
/[\1]/xg;
I don't like that \2
though because I hate counting relative positions. I can use the relative backreferences in Perl 5.10. The \g{-1}
refers to the immediately preceding capture group:
use 5.010;
$str =~ s/
\b
(
(\w+)
(?:
\s+
\g{-1}
\b
)+
)
/[\1]/xg;
Counting isn't all that great either, so I can use labeled matches:
use 5.010;
$str =~ s/
\b
(
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
)
/[\1]/xg;
I can label the first capture ($1
) and access its value in %+
later:
use 5.010;
$str =~ s/
\b
(?<dups>
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
)
/[$+{dups}]/xg;
I shouldn't really need that first capture though since it's really just there to refer to everything that matched. Sadly, it looks like ${^MATCH}
isn't set early enough for me to use it in the replacement side. I think that's a bug. This should work but doesn't:
$str =~ s/
\b
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
/[${^MATCH}]/pgx; # DOESN'T WORK
I'm checking this on blead, but that's going to take a little while to compile on my tiny machine.