ansaurus

Question

How can I highlight consecutive duplicate words with a Perl regular expression?

Answer 1

+7 A:

This works:

$str =~ s/\b((\w+)\s+\2)\b/[\1]/g;

Kip 2010-03-24 04:01:20

`$str =~ s/\b((\w+)(?:\s+\2)+)\b/[\1]/g;` to match any number of repetitions

Eric Strom 2010-03-24 04:24:34

@briandfoy: ...which is precisely what the question asked for, before you changed it. And Eric posted in a comment a version which matches more than one repetition.

Kip 2010-03-24 18:49:07

@brian: How do you know that from the 2 lines of context in the original question?

Jon Seigel 2010-03-24 19:39:55

@brian: I bet the OP knows better than anyone.

Jon Seigel 2010-03-24 19:42:41

@brian: I'm gonna go edit some of your questions, because I know your problems better than you do. brb

Jon Seigel 2010-03-24 19:49:12

@brian: Given that the OP already selected this as the answer it seems unlikely they didn't know what they were asking since they found it useful.

Ahmad Mageed 2010-03-24 19:54:06

@brian: The original question specifically asked for matching "double words", which is quite clear. An answer can be expanded to handle other cases also, but it's not OK to say that it's not right if it doesn't. I don't think that it was your intention, but your editing of the question to match your answer makes you look suspicious.

Guffa 2010-03-24 19:55:03

Discussion here: http://meta.stackoverflow.com/questions/43842/someone-other-than-op-edits-question-then-comments-that-accepted-answer-is-wrong/

Shog9 2010-03-24 20:05:30

@brian I can agree with that, but such users tend to continue asking follow up questions before selecting an answer (or selecting and asking in comments).

Ahmad Mageed 2010-03-24 20:22:00

To defuse the situation, I've removed my comments, explained myself in the meta thread, and made my answer community wiki.

brian d foy 2010-03-24 20:45:44

Answer 2

+2 A:

You can try:

$str = "Thus joyful Troy Troy maintained the the watch of night...";
$str =~s{\b(\w+)\s+\1\b}{[$1 $1]}g;
print "$str"; # prints Thus joyful [Troy Troy] maintained [the the] watch of night...

Regex used: \b(\w+)\s+\1\b

Explanation:

\b: word bondary
\w+: a word
(): to remember the above word
\s+: whitespace
\1: the remembered word

It effectively finds two full words separated by whitespace and places [ ] around them.

EDIT:

If you want to preserve the amount of whitespace between the words you can use:

$str =~s{\b(\w+)(\s+)\1\b}{[$1$2$1]}g;

codaddict 2010-03-24 04:02:45

this doesn't preserve the amount and type of whitespace between the duplicated words, if that matters to OP

Kip 2010-03-24 04:07:02

@Kip: you are right. Thanks. I've edited my ans.

codaddict 2010-03-24 04:10:10

This only finds two words repeated. It would be better if it found all repeated words. :)

brian d foy 2010-03-24 17:04:00

Answer 3

+13 A:

This is similar to one of the Learning Perl exercises. The trick is to catch all of the repeated words, so you need a "one or more" quantifier on the duplication:

 $str = 'This is Goethe the the the their sentence';

 $str =~ s/\b((\w+)(?:\s+\2\b)+)/[\1]/g;

The features I'm about to use are described in either perlre, when they apply at a pattern, or perlop when they affect how the substitution operator does its work.

If you like the /x flag to add insignificant whitespace and comments:

 $str =~ s/
      \b
      (
         (\w+)
         (?:
          \s+
          \2
          \b
         )+
      )
     /[\1]/xg;

I don't like that \2 though because I hate counting relative positions. I can use the relative backreferences in Perl 5.10. The \g{-1} refers to the immediately preceding capture group:

 use 5.010;
 $str =~ s/
      \b
      (
         (\w+)
         (?:
          \s+
          \g{-1}
          \b
         )+
      )
     /[\1]/xg;

Counting isn't all that great either, so I can use labeled matches:

 use 5.010;
 $str =~ s/
      \b
      (
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
      )
     /[\1]/xg;

I can label the first capture ($1) and access its value in %+ later:

 use 5.010;
 $str =~ s/
      \b
      (?<dups>
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
      )
     /[$+{dups}]/xg;

I shouldn't really need that first capture though since it's really just there to refer to everything that matched. Sadly, it looks like ${^MATCH} isn't set early enough for me to use it in the replacement side. I think that's a bug. This should work but doesn't:

 $str =~ s/
      \b
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
     /[${^MATCH}]/pgx;   # DOESN'T WORK

I'm checking this on blead, but that's going to take a little while to compile on my tiny machine.

brian d foy 2010-03-24 17:02:43

+1 for finding a bug in perl.

Kevin Panko 2010-03-24 18:05:46

ansaurus

tags:

views:

answers:

How can I highlight consecutive duplicate words with a Perl regular expression?

related questions