tags:

views:

232

answers:

7

I have been looking for a regular expression with Google for an hour or so now and can't seem to work this one out :(

If I have a number, say:

2345

and I want to find any other number with the same digits but in a different order, like this:

2345

For example, I match

3245 or 5432 (same digits but different order)

How would I write a regular expression for this?

A: 

The simplest regular expression is just all 24 permutations added up via the or operator:

/2345|3245|5432|.../;

That said, you don't want to solve this with a regex if you can get away with it. A single pass through the two numbers as strings is probably better: 1. Check the string length of both strings - if they're different you're done. 2. Build a hash of all the digits from the number you're matching against. 3. Run through the digits in the number you're checking. If you hit a match in the hash, mark it as used. Keep going until you don't get an unused match in the hash or run out of items.

Epsilon Prime
A: 

I think it's very simple to achieve if you're OK with matching a number that doesn't use all of the digits. E.g. if you have a number 1234 and you accept a match with the number of 1111 to return TRUE;

Let me use PHP for an example as you haven't specified what language you use.

$my_num = 1245;
$my_pattern = '/[' . $my_num . ']{4}/'; // this resolves to pattern: /[1245]{4}/
$my_pattern2 = '/[' . $my_num . ']+/'; // as above but numbers can by of any length

$number1 = 4521;
$match = preg_match($my_pattern, $number1); // will return TRUE

$number2 = 2222444111;
$match2 = preg_match($my_pattern2, $number2); // will return TRUE

$number3 = 888;
$match3 = preg_match($my_pattern, $number3); // will return FALSE
$match4 = preg_match($my_pattern2, $number3); // will return FALSE

Something similar will work in Perl as well.

Michal M
Will find that 1223 matches 1331, no?
PhiLho
I guess he does not want the $number2 (2222444111) to return true.
Shree
PhiLho: no, it won't.
Michal M
+3  A: 

Put the digits of each number in two arrays, sort the arrays, find out if they hold the same digits at the same indices.

RegExes are not the right tool for this task.

PhiLho
+7  A: 

I don't think a regex is appropriate. So here is an idea that is faster than a regex for this situation:

  • check string lengths, if they are different, return false
  • make a hash from the character (digits in your case) to integers for counting
  • loop through the characters of your first string:
    • increment the counter for that character: hash[character]++
  • loop through the characters of the second string:
    • decrement the counter for that character: hash[character]--
    • break if any count is negative (or nonexistent)
  • loop through the entries, making sure each is 0:
    • if all are 0, return true
    • else return false

EDIT: Java Code (I'm using Character for this example, not exactly Unicode friendly, but it's the idea that matters now):

import java.util.*;

public class Test
{
    public boolean isSimilar(String first, String second)
    {
        if(first.length() != second.length()) 
            return false;
        HashMap<Character, Integer> hash = new HashMap<Character, Integer>();
        for(char c : first.toCharArray())
        {
            if(hash.get(c) != null)
            {
                int count = hash.get(c);
                count++;
                hash.put(c, count);
            }
            else
            {
                hash.put(c, 1);
            }
        }
        for(char c : second.toCharArray())
        {
            if(hash.get(c) != null)
            {
                int count = hash.get(c);
                count--;
                if(count < 0)
                    return false;
                hash.put(c, count);
            }
            else
            {
                return false;
            }
        }
        for(Integer i : hash.values())
        {
            if(i.intValue()!=0)
                return false;
        }
        return true;
    }

    public static void main(String ... args)
    {
        //tested to print false
        System.out.println(new Test().isSimilar("23445", "5432"));

        //tested to print true
        System.out.println(new Test().isSimilar("2345", "5432"));
    }
}

This will also work for comparing letters or other character sequences, like "god" and "dog".

geowa4
Could you give me an example in code for this? C# or Java?
Gaz
Why hash, and not just an array of 10 elements?
Svante
@Svante: that would certainly work if you are only using numbers, but using a hash allows for all text--a more general solution.
geowa4
A regex is indeed possible. See Chad's answer.
Steve Wortham
+1  A: 

You could do something like this to ensure the right characters and length

 [2345]{4}

Ensuring they only exist once is trickier and why this is not suited to regexes

(?=.*2.*)(?=.*3.*)(?=.*4.*)(?=.*5.*)[2345]{4}
Chad
Well done. It's not pretty, but in fact you wrote it the same way I was going to. Also you may want to add the ^ and $ assertions. Here it is with test cases: http://regexhero.net/tester/?id=3d1d61a9-b757-4722-814b-4a3b98119114
Steve Wortham
Also note that despite how many lookaheads there are involved in this, it's still quite fast to execute. From my tests, Regex Hero is able to process this regex 100,000 times in about 0.8 seconds.
Steve Wortham
Ah, you can also remove the .* from the end of each lookahead since you're only looking for the first occurrence of each number. So it'd become this: ^(?=.*2)(?=.*3)(?=.*4)(?=.*5)[2345]{4}$
Steve Wortham
does it work in Java/C#?
geowa4
It works in C# for sure. It should work in Java too. The only thing fancy about it is the lookaheads, and most common regex implementations support lookaheads and lookbehinds (except for Javascript).
Steve Wortham
@Steve, nice job shortening it
Chad
yep, does work; gj
geowa4
so i +1'ed this earlier but its not showing up, and i cant vote on it now. so idk
geowa4
@geowa4, no worries.
Chad
+10  A: 

There is an "elegant" way to do it with a single regex:

^(?:2()|3()|4()|5()){4}\1\2\3\4$

will match the digits 2, 3, 4 and 5 in any order. All four are required.

Explanation:

(?:2()|3()|4()|5()) matches one of the numbers 2, 3, 4, or 5. The trick is now that the capturing parentheses match an empty string after matching a number (which always succeeds).

{4} requires that this happens four times.

\1\2\3\4 then requires that all four backreferences have participated in the match - which they do if and only if each number has occurred once. Since \1\2\3\4 matches an empty string, it will always match as long as the previous condition is true.

For five digits, you'd need

^(?:2()|3()|4()|5()|6()){5}\1\2\3\4\5$

etc...

This will work in nearly any regex flavor except JavaScript.

Tim Pietzcker
Wow. That is crazy. +1 for even being able to conceptualize it.
chaos
I got the idea from Jan Goyvaerts'/Steven Levithan's book "Regular Expression Cookbook" (p. 304: Exploiting empty backreferences). Buy it today :)
Tim Pietzcker
As for performance, this regex is quite OK as long as the numbers are unique. So "5432" or "2345" will match in 28 steps of the regex engine, and "2344" will fail in 47 steps.But if your test string is "1111", the regex engine needs 278 steps to verify that `^(?:1()|1()|1()|1()){4}\1\2\3\4$` matches because it has to go through a lot of permutations. Looks like O(n!) which is not the kind of performance characteristic you want.
Tim Pietzcker
Excellent, thanks very much for your help!
Gaz
This seems to be faster overall: ^(?=.*2)(?=.*3)(?=.*4)(?=.*5)[2345]{4}$. I mean, when it matches it's about the same performance as yours. But when you're testing something that isn't going to match, then the lookahead method is substantially faster since it knows to quit as soon as the first lookahead fails.
Steve Wortham
A: 

Regular expressions are not appropriate for this purpose. Here is a Perl script:

#/usr/bin/perl

use strict;
use warnings;

my $src = '2345';
my @test = qw( 3245 5432 5542 1234 12345 );

my $canonical = canonicalize( $src );

for my $candidate ( @test ) {
    next unless $canonical eq canonicalize( $candidate );
    print "$src and $candidate consist of the same digits\n";
}

sub canonicalize { join '', sort split //, $_[0] }

Output:

C:\Temp> ks
2345 and 3245 consist of the same digits
2345 and 5432 consist of the same digits
Sinan Ünür