tags:

views:

1142

answers:

7

I have two strings: the first's value is "catdog" and the second's is "got".

I'm trying to find a regex that tells me if the letters for "got" are in "catdog". I'm particularly looking to avoid the case where there are duplicate letters. For example, I know "got" is a match, however "gott" is not a match because there are not two "t" in "catdog".

EDIT:

Based on Adam's response below this is the C# code I got to work in my solution. Thanks to all those that responded.

Note: I had to convert the char to int and subtract 97 to get the appropriate index for the array. In my case the letters are always lower case.

    private bool CompareParts(string a, string b)
    {

        int[] count1 = new int[26];
        int[] count2 = new int[26];

        foreach (var item in a.ToCharArray())
            count1[(int)item - 97]++;

        foreach (var item in b.ToCharArray())
            count2[(int)item - 97]++;

        for (int i = 0; i < count1.Length; i++)
            if(count2[i] > count1[i])
                return false;

        return true;
    }
A: 

You want a string that matches exact those letters, exactly once. It depends what you're writing the regex in, but it's going to be something like

^[^got]*(g|o|t)[^got]$

If you've got an operator for "exactly one match" that will help.

Charlie Martin
[got] is equivalent to (g|o|t) and much more efficient...
PhiLho
A: 

I don't think there is a sane way to do this with regular expressions. The insane way would be to write out all the permutations:

/^(c?a?t?d?o?g?|c?a?t?d?g?o?| ... )$/

Now, with a little trickery you could do this with a few regexps (example in Perl, untested):

$foo = 'got';
$foo =~ s/c//;
$foo =~ s/a//;
...
$foo =~ s/d//;
# if $foo is now empty, it passes the test.

Sane people would use a loop, of course:

$foo = 'got'
foreach $l (split(//, 'catdog') {
    $foo =~ s/$l//;
}
# if $foo is now empty, it passes the test.

There are much better performing ways to pull this off, of course, but they don't use regexps. And there are no doubt ways to do it if e.g., you can use Perl's extended regexp features like embedded code.

derobert
+7  A: 

You're using the wrong tool for the job. This is not something regular expressions are capable of handling easily. Fortunately, it's relatively easy to do this without regular expressions. You just count up the number of occurrences of each letter within both strings, and compare the counts between the two strings - if for each letter of the alphabet, the count in the first string is at least as large as the count in the second string, then your criteria are satisfied. Since you didn't specify a language, here's an answer in pseudocode that should be easily translatable into your language:

bool containsParts(string1, string2)
{
    count1 = array of 26 0's
    count2 = array of 26 0's

    // Note: be sure to check for an ignore non-alphabetic characters,
    // and do case conversion if you want to do it case-insensitively
    for each character c in string1:
        count1[c]++
    for each character c in string2:
        count2[c]++

    for each character c in 'a'...'z':
        if count1[c] < count2[c]:
            return false

    return true
}
Adam Rosenfield
Good answer - but I think you have return true and return false reversed, don't you? You can't break early and return success, can you?
Jonathan Leffler
Whoops, goot catch. Fixed now.
Adam Rosenfield
Good answer - there's a more succinct way of doing this though below
BenAlabaster
A: 

Charlie Martin almost has it right, but you have to do a complete pass for each letter. You can do that with a single regex by using lookaheads for all but the last pass:

/^
 (?=[^got]*g[^got]*$)
 (?=[^got]*o[^got]*$)
 [^got]*t[^got]*
$/x

This makes a nice exercise for honing your regex skills, but if I had to do this in real-life, I wouldn't do it this way. A non-regex approach will require a lot more typing, but any minimally competent programmer will be able to understand and maintain it. If you use a regex, that hypothetical maintainer will also have to be more-than-minimally competent at regexes.

Alan Moore
A: 

@Adam Rosenfield's solution in Python:

from collections import defaultdict

def count(iterable):
    c = defaultdict(int)
    for hashable in iterable:
        c[hashable] += 1
    return c

def can_spell(word, astring):
    """Whether `word` can be spelled using `astring`'s characters."""

    count_string = count(astring)
    count_word   = count(word)

    return all(count_string[c] >= count_word[c] for c in word)
J.F. Sebastian
+3  A: 

Previous suggestions have already been made that perhaps regex isn't the best way to do this and I agree, however, your accepted answer is a little verbose considering what you're trying to achieve and that is test to see if a set of letters is the subset of another set of letters.

Consider the following code which achieves this in a single line of code:

MatchString.ToList().ForEach(Item => Input.Remove(Item));

Which can be used as follows:

public bool IsSubSetOf(string InputString, string MatchString) 
{
  var InputChars = InputString.ToList(); 
  MatchString.ToList().ForEach(Item => InputChars.Remove(Item)); 
  return InputChars.Count == 0;
}

You can then just call this method to verify if it's a subset or not.

What is interesting here is that "got" will return a list with no items because each item in the match string only appears once, but "gott" will return a list with a single item because there would only be a single call to remove the "t" from the list. Consequently you would have an item left in the list. That is, "gott" is not a subset of "catdog" but "got" is.

You could take it one step further and put the method into a static class:

using System;
using System.Linq;
using System.Runtime.CompilerServices;

static class extensions
{
    public static bool IsSubSetOf(this string InputString, string MatchString)
    {
        var InputChars = InputString.ToList();
        MatchString.ToList().ForEach(Item => InputChars.Remove(Item));
        return InputChars.Count == 0;
    }
}

which makes your method into an extension of the string object which actually makes thins a lot easier in the long run, because you can now make your calls like so:

Console.WriteLine("gott".IsSubSetOf("catdog"));
BenAlabaster
Your code looks so elegant that I just had to try it. For some reason your method not work for me. I dropped it straight into my project as-is and replaced the method I called.In addition, it took almost twice as long to run as the method I posted above.
Sailing Judo
@Sailing Judo: Really? I cut and pasted it straight from my project into here. In fact cutting it from here and pasting it into a project works just fine too. While LINQ isn't perhaps the fastest solution, it does perform well enough that I wouldn't have discarded it for something quicker...
BenAlabaster
@Sailing Judo: Agreed, mine doesn't perform quite so well, but the difference isn't too large to discount the ease of maintenance. Running a test comparison of 1,000,000,000 iterations, your code results in an average of 10 ticks per iteration, mine comes in at 16...
BenAlabaster
@Sailing Judo: ...so yours is faster, but mine's easier to maintain
BenAlabaster
I agree, I like your code better. But for some reason it did not give me the results I needed. I didn't explore the reason why since I already had a working solution. (Actually, now that I think about it, I might know why. I might have tested backwards...)
Sailing Judo
Anyway... it turns out speed is important in this case. I'm looping through a list of 28,000 records and testing each for a match. The time difference in my test was comparable to your test.. mine ran at 101ms, and yours was 167ms. Make no mistake... I like your solution best, just not for this.
Sailing Judo
Yep.. confirmed. Your code works fine for me. I had accidentally swapped the arguments.
Sailing Judo
Ooh, yeah, if you're looping through many records, then for sure, performance is a big issue. Glad you got a solution though.
BenAlabaster
A: 

The best way to do it with regular expressions is, IMO:

A. Sort the characters in the large string (search space) Thus: turn "catdog" into "acdgot"

B.

  1. Do the same with the string of which you search the characters of: "gott" becomes, eh, "gott"...

  2. Insert ".*" between each of these characters

  3. Use the latter as the regular expression to search in the former.

For example, some Perl code (if you don't mind):

$main = "catdog"; $search = "gott";
# break into individual characters, sort, and reconcatenate
$main = join '', sort split //, $main;
$regexp = join ".*", sort split //, $search;
print "Debug info: search in '$main' for /$regexp/ \n";
if($main =~ /$regexp/) {
    print "Found a match!\n";
} else {
    print "Sorry, no match...\n";
}

This prints:

Debug info: search in 'acdgot' for /g.*o.*t.*t/
Sorry, no match...

Drop one "t" and you get a match.

bart