tags:

views:

358

answers:

3

I want a Perl regular expression that will match duplicated words in a string.

Given the following input:

$str = "Thus joyful Troy Troy maintained the the watch of night..."

I would like the following output:

Thus joyful [Troy Troy] maintained [the the] watch of night...
+7  A: 

This works:

$str =~ s/\b((\w+)\s+\2)\b/[\1]/g;
Kip
`$str =~ s/\b((\w+)(?:\s+\2)+)\b/[\1]/g;` to match any number of repetitions
Eric Strom
@briandfoy: ...which is precisely what the question asked for, before you changed it. And Eric posted in a comment a version which matches more than one repetition.
Kip
@brian: How do you know that from the 2 lines of context in the original question?
Jon Seigel
@brian: I bet the OP knows better than anyone.
Jon Seigel
@brian: I'm gonna go edit some of your questions, because I know your problems better than you do. brb
Jon Seigel
@brian: Given that the OP already selected this as the answer it seems unlikely they didn't know what they were asking since they found it useful.
Ahmad Mageed
@brian: The original question specifically asked for matching "double words", which is quite clear. An answer can be expanded to handle other cases also, but it's not OK to say that it's not right if it doesn't. I don't think that it was your intention, but your editing of the question to match your answer makes you look suspicious.
Guffa
Discussion here: http://meta.stackoverflow.com/questions/43842/someone-other-than-op-edits-question-then-comments-that-accepted-answer-is-wrong/
Shog9
@brian I can agree with that, but such users tend to continue asking follow up questions before selecting an answer (or selecting and asking in comments).
Ahmad Mageed
To defuse the situation, I've removed my comments, explained myself in the meta thread, and made my answer community wiki.
brian d foy
+2  A: 

You can try:

$str = "Thus joyful Troy Troy maintained the the watch of night...";
$str =~s{\b(\w+)\s+\1\b}{[$1 $1]}g;
print "$str"; # prints Thus joyful [Troy Troy] maintained [the the] watch of night...

Regex used: \b(\w+)\s+\1\b

Explanation:

  • \b: word bondary
  • \w+: a word
  • (): to remember the above word
  • \s+: whitespace
  • \1: the remembered word

It effectively finds two full words separated by whitespace and places [ ] around them.

EDIT:

If you want to preserve the amount of whitespace between the words you can use:

$str =~s{\b(\w+)(\s+)\1\b}{[$1$2$1]}g;
codaddict
this doesn't preserve the amount and type of whitespace between the duplicated words, if that matters to OP
Kip
@Kip: you are right. Thanks. I've edited my ans.
codaddict
This only finds two words repeated. It would be better if it found all repeated words. :)
brian d foy
+13  A: 

This is similar to one of the Learning Perl exercises. The trick is to catch all of the repeated words, so you need a "one or more" quantifier on the duplication:

 $str = 'This is Goethe the the the their sentence';

 $str =~ s/\b((\w+)(?:\s+\2\b)+)/[\1]/g;

The features I'm about to use are described in either perlre, when they apply at a pattern, or perlop when they affect how the substitution operator does its work.

If you like the /x flag to add insignificant whitespace and comments:

 $str =~ s/
      \b
      (
         (\w+)
         (?:
          \s+
          \2
          \b
         )+
      )
     /[\1]/xg;

I don't like that \2 though because I hate counting relative positions. I can use the relative backreferences in Perl 5.10. The \g{-1} refers to the immediately preceding capture group:

 use 5.010;
 $str =~ s/
      \b
      (
         (\w+)
         (?:
          \s+
          \g{-1}
          \b
         )+
      )
     /[\1]/xg;

Counting isn't all that great either, so I can use labeled matches:

 use 5.010;
 $str =~ s/
      \b
      (
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
      )
     /[\1]/xg;

I can label the first capture ($1) and access its value in %+ later:

 use 5.010;
 $str =~ s/
      \b
      (?<dups>
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
      )
     /[$+{dups}]/xg;

I shouldn't really need that first capture though since it's really just there to refer to everything that matched. Sadly, it looks like ${^MATCH} isn't set early enough for me to use it in the replacement side. I think that's a bug. This should work but doesn't:

 $str =~ s/
      \b
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
     /[${^MATCH}]/pgx;   # DOESN'T WORK

I'm checking this on blead, but that's going to take a little while to compile on my tiny machine.

brian d foy
+1 for finding a bug in perl.
Kevin Panko