ansaurus

Question

regular expression for finding parts of a string within another

Answer 1

A:

You want a string that matches exact those letters, exactly once. It depends what you're writing the regex in, but it's going to be something like

^[^got]*(g|o|t)[^got]$

If you've got an operator for "exactly one match" that will help.

Charlie Martin 2008-12-19 05:52:20

[got] is equivalent to (g|o|t) and much more efficient...

PhiLho 2008-12-19 10:15:57

Answer 2

A:

I don't think there is a sane way to do this with regular expressions. The insane way would be to write out all the permutations:

/^(c?a?t?d?o?g?|c?a?t?d?g?o?| ... )$/

Now, with a little trickery you could do this with a few regexps (example in Perl, untested):

$foo = 'got';
$foo =~ s/c//;
$foo =~ s/a//;
...
$foo =~ s/d//;
# if $foo is now empty, it passes the test.

Sane people would use a loop, of course:

$foo = 'got'
foreach $l (split(//, 'catdog') {
    $foo =~ s/$l//;
}
# if $foo is now empty, it passes the test.

There are much better performing ways to pull this off, of course, but they don't use regexps. And there are no doubt ways to do it if e.g., you can use Perl's extended regexp features like embedded code.

derobert 2008-12-19 05:57:04

Answer 3

+7 A:

You're using the wrong tool for the job. This is not something regular expressions are capable of handling easily. Fortunately, it's relatively easy to do this without regular expressions. You just count up the number of occurrences of each letter within both strings, and compare the counts between the two strings - if for each letter of the alphabet, the count in the first string is at least as large as the count in the second string, then your criteria are satisfied. Since you didn't specify a language, here's an answer in pseudocode that should be easily translatable into your language:

bool containsParts(string1, string2)
{
    count1 = array of 26 0's
    count2 = array of 26 0's

    // Note: be sure to check for an ignore non-alphabetic characters,
    // and do case conversion if you want to do it case-insensitively
    for each character c in string1:
        count1[c]++
    for each character c in string2:
        count2[c]++

    for each character c in 'a'...'z':
        if count1[c] < count2[c]:
            return false

    return true
}

Adam Rosenfield 2008-12-19 06:01:24

Good answer - but I think you have return true and return false reversed, don't you? You can't break early and return success, can you?

Jonathan Leffler 2008-12-19 07:05:53

Whoops, goot catch. Fixed now.

Adam Rosenfield 2008-12-19 16:32:53

Good answer - there's a more succinct way of doing this though below

BenAlabaster 2008-12-19 20:50:48

Answer 4

A:

Charlie Martin almost has it right, but you have to do a complete pass for each letter. You can do that with a single regex by using lookaheads for all but the last pass:

/^
 (?=[^got]*g[^got]*$)
 (?=[^got]*o[^got]*$)
 [^got]*t[^got]*
$/x

This makes a nice exercise for honing your regex skills, but if I had to do this in real-life, I wouldn't do it this way. A non-regex approach will require a lot more typing, but any minimally competent programmer will be able to understand and maintain it. If you use a regex, that hypothetical maintainer will also have to be more-than-minimally competent at regexes.

Alan Moore 2008-12-19 08:05:25

Answer 5

A:

@Adam Rosenfield's solution in Python:

from collections import defaultdict

def count(iterable):
    c = defaultdict(int)
    for hashable in iterable:
        c[hashable] += 1
    return c

def can_spell(word, astring):
    """Whether `word` can be spelled using `astring`'s characters."""

    count_string = count(astring)
    count_word   = count(word)

    return all(count_string[c] >= count_word[c] for c in word)

J.F. Sebastian 2008-12-19 09:05:02

Answer 6

+3 A:

Previous suggestions have already been made that perhaps regex isn't the best way to do this and I agree, however, your accepted answer is a little verbose considering what you're trying to achieve and that is test to see if a set of letters is the subset of another set of letters.

Consider the following code which achieves this in a single line of code:

MatchString.ToList().ForEach(Item => Input.Remove(Item));

Which can be used as follows:

public bool IsSubSetOf(string InputString, string MatchString) 
{
  var InputChars = InputString.ToList(); 
  MatchString.ToList().ForEach(Item => InputChars.Remove(Item)); 
  return InputChars.Count == 0;
}

You can then just call this method to verify if it's a subset or not.

What is interesting here is that "got" will return a list with no items because each item in the match string only appears once, but "gott" will return a list with a single item because there would only be a single call to remove the "t" from the list. Consequently you would have an item left in the list. That is, "gott" is not a subset of "catdog" but "got" is.

You could take it one step further and put the method into a static class:

using System;
using System.Linq;
using System.Runtime.CompilerServices;

static class extensions
{
    public static bool IsSubSetOf(this string InputString, string MatchString)
    {
        var InputChars = InputString.ToList();
        MatchString.ToList().ForEach(Item => InputChars.Remove(Item));
        return InputChars.Count == 0;
    }
}

which makes your method into an extension of the string object which actually makes thins a lot easier in the long run, because you can now make your calls like so:

Console.WriteLine("gott".IsSubSetOf("catdog"));

BenAlabaster 2008-12-19 20:40:04

Your code looks so elegant that I just had to try it. For some reason your method not work for me. I dropped it straight into my project as-is and replaced the method I called.In addition, it took almost twice as long to run as the method I posted above.

Sailing Judo 2008-12-19 21:59:11

@Sailing Judo: Really? I cut and pasted it straight from my project into here. In fact cutting it from here and pasting it into a project works just fine too. While LINQ isn't perhaps the fastest solution, it does perform well enough that I wouldn't have discarded it for something quicker...

BenAlabaster 2008-12-19 22:29:36

@Sailing Judo: Agreed, mine doesn't perform quite so well, but the difference isn't too large to discount the ease of maintenance. Running a test comparison of 1,000,000,000 iterations, your code results in an average of 10 ticks per iteration, mine comes in at 16...

BenAlabaster 2008-12-19 22:58:45

@Sailing Judo: ...so yours is faster, but mine's easier to maintain

BenAlabaster 2008-12-19 22:59:19

I agree, I like your code better. But for some reason it did not give me the results I needed. I didn't explore the reason why since I already had a working solution. (Actually, now that I think about it, I might know why. I might have tested backwards...)

Sailing Judo 2008-12-19 23:56:06

Anyway... it turns out speed is important in this case. I'm looping through a list of 28,000 records and testing each for a match. The time difference in my test was comparable to your test.. mine ran at 101ms, and yours was 167ms. Make no mistake... I like your solution best, just not for this.

Sailing Judo 2008-12-19 23:58:34

Yep.. confirmed. Your code works fine for me. I had accidentally swapped the arguments.

Sailing Judo 2008-12-20 00:15:01

Ooh, yeah, if you're looping through many records, then for sure, performance is a big issue. Glad you got a solution though.

BenAlabaster 2008-12-20 01:29:53

Answer 7

A:

The best way to do it with regular expressions is, IMO:

A. Sort the characters in the large string (search space) Thus: turn "catdog" into "acdgot"

B.

Do the same with the string of which you search the characters of: "gott" becomes, eh, "gott"...
Insert ".*" between each of these characters
Use the latter as the regular expression to search in the former.

For example, some Perl code (if you don't mind):

$main = "catdog"; $search = "gott";
# break into individual characters, sort, and reconcatenate
$main = join '', sort split //, $main;
$regexp = join ".*", sort split //, $search;
print "Debug info: search in '$main' for /$regexp/ \n";
if($main =~ /$regexp/) {
    print "Found a match!\n";
} else {
    print "Sorry, no match...\n";
}

This prints:

Debug info: search in 'acdgot' for /g.*o.*t.*t/
Sorry, no match...

Drop one "t" and you get a match.

bart 2008-12-19 22:44:53

ansaurus

tags:

views:

answers:

regular expression for finding parts of a string within another

related questions