tags:

views:

555

answers:

6

A friend asked me this and I was stumped: Is there a way to craft a regular expression that matches a sequence of the same character? E.g., match on 'aaa', 'bbb', but not 'abc'?

m|\w{2,3}| wouldn't do the trick as it would match 'abc'.

m|a{2,3}| wouldn't do the trick as it wouldn't match 'bbb', 'ccc', etc.

A: 

Answering my own question, but got it:

m|(\w)\1+|

Bill
\W is the opposite of what you want, isn't it?
Telemachus
Telemachus is right, this will not match the examples you gave in the question.
gpojd
Also it is better not to use pipes (or any other non default delimiters) for the regular expression unless you have a reason to.
Pat
+16  A: 

Sure thing! Grouping and references are your friends:

(.)\1+

Will match 2 or more occurences of the same character. For word constituent characters only, use \w instead of ., i.e.:

(\w)\1+
David Hanak
This will only match some chars, and miss ones like '###'. The examples he gave where alphabetic chars, but it doesn't really ask for only alphabetic ones. I'd replace '\w' with '.'.
gpojd
Well, based on the non-operational examples the questioner gave, I assumed s/he wanted to match alphabetic characters only. I should have expressed this in the explanation though.
David Hanak
+1  A: 

This is what backreferences are for.

m/(\w)\1\1/

will do the trick.

friedo
This would not match 'aa'.
gpojd
+2  A: 

This will match more than \w would, like @@@:

/(.)\1+/
gpojd
This is the right one, for "a sequence of the same character", and not just the "aaa", "bbb" examples. +1
Axeman
+9  A: 

Note that in Perl 5.10 we have alternative notations for backreferences as well.

foreach (qw(aaa bbb abc)) {
  say;
  say ' original' if /(\w)\1+/;
  say ' new way'  if /(\w)\g{1}+/;
  say ' relative' if /(\w)\g{-1}+/;
  say ' named'    if /(?'char'\w)\g{char}+/;
  say ' named'    if /(?<char>\w)\k<char>+/;
}
oylenshpeegul
http://perldoc.perl.org/perlre.html or http://perldoc.perl.org/search.html?q=perlre
Brad Gilbert
A: 

This is also possible using pure regular expressions (i.e. those that describe regular languages -- not Perl regexps). Unfortunately, it means a regexp whose length is proportional to the size of the alphabet, e.g.:

(a* + b* + ... + z*)

Where a...z are the symbols in the finite alphabet.

So Perl regexps, although a superset of pure regular expressions, definitely have their advantages even when you just want to use them for pure regular expressions!

Edmund