tags:

views:

1355

answers:

4

Is it possible to detect repeated number patterns with a regular expression?

So for example, if I had the following string "034503450345", would it be possible to match the repeated sequence 0345? I have a feeling this is beyond the scope of regex, but I thought I would ask here anyway to see if I have missed something.

+5  A: 

Yes, you can - here's a Python test case

import re
print re.search(r"(\d+).*\1", "8034503450345").group(1)
# Prints 0345

The regular expression says "find some sequence of digits, then any amount of other stuff, then the same sequence again."

On a barely-related note, here's one of my favourite regular expressions - a prime number detector:

import re
for i in range(2, 100):
    if not re.search(r"^(xx+)\1+$", "x"*i):
        print i
RichieHindle
Your prime number detector finds 0 and 1 to be prime :-)
balpha
Any idea why the following example is *only* matching `8` and not `0345`?In [18]: foo = re.search(r"(\d+).*\1", "80345824103452420345")In [19]: foo.groups()Out[19]: ('8',)
Hank Gay
@balpha: Good pont - fixed. 8-)
RichieHindle
@Hank Gay: It'll match the first repeating group it finds, which in that case is "8". Use more "\d"s, or braces, to put a minimum length on it.
RichieHindle
@RichieHindle: Thanks. I had somehow convinced myself (at first) that it was matching the longest possible, not the first.
Hank Gay
+10  A: 

This expression will match one or more repeating groups:

(.+)(?=\1+)


Here is the same expression broken down, (using commenting so it can still be used directly as a regex).

(?x)  # enable regex comment mode
(     # start capturing group
.+    # one or more of any character (excludes newlines by default)
)     # end capturing group
(?=   # begin lookahead
\1+   # match one or more of the first capturing group
)     # end lookahead


To match a specific pattern, change the .+ to that pattern, e.g. \d+ for one or more numbers, or \d{4,} to match 4 or more numbers.

To match a specific number of the pattern, change \1+, e.g to \1{4} for four repetitions.

To allow the repetition to not be next to each other, you can add .*? inside the lookahead.

Peter Boughton
Great explanation +1
ichiban
Good example, explained very well
Mark Withers
Great explanation. Excellent extension. Thanks!! +1
Toto
+6  A: 

Just to add a note to the (correct) answer from RichieHindle:

Note that while Python's regexp implementation (and many others, such as Perl's) can do this, this is no longer a regular expression in the narrow sense of the word.

Your example is not a regular language, hence cannot be handled by a pure regular expression. See e.g. the excellent Wikipedia article for details.

While this is mostly only of academic interest, there are some practical consequences. Real regular expressions can make much better guarantees for maximum runtimes than in this case. So you could get performance problems at some point.

Not to say that it's not a good solution, but you should realize that you're at the limit of what regular expressions (even in extended form) are capable of, and might want to consider other solutions in case of problems.

sleske
Very interesting reading, thanks for that.
Mark Withers
+1  A: 

This is the C# code, that uses the backreference construct to find repeated digits. It will work with 034503450345, 123034503450345, 034503450345345, 232034503450345423. The regex is much easier and clearer to understand.

/// <summary>
/// Assigns repeated digits to repeatedDigits, if the digitSequence matches the pattern
/// </summary>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetRepeatedDigits(string digitSequence, out string repeatedDigits)
{
    repeatedDigits = null;

    string pattern = @"^\d*(?<repeat>\d+)\k<repeat>+\d*$";

    if (Regex.IsMatch(digitSequence, pattern))
    {
        Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        repeatedDigits = r.Match(digitSequence).Result("${repeat}");
        return true;
    }
    else
        return false;
}
Rashmi Pandit
Very nice! I like the use of the named group. Production quality code, commented and ready to be copied. Thanks very much!
Mark Withers
"Ready to be copied" :D .. I like that!!!!
Rashmi Pandit