tags:

views:

57

answers:

3

Hi,

I am looking for a way to generate regex from based on the differences between 2 input strings. The regex can then be used to match the 3rd string that is also similar to the 2 input strings.

For example, given 3 strings: f1, f2, and f3

f1:

111xxx222

f2:

111yyy222

f3:

111zzz222

I am wondering if there is a generic way to generate the following regex from f1 and f2 that can be used to extract "zzz" in f3:

111(.*)222
+2  A: 

Assuming the 2 strings are equal length, with Perl you can do:

my @x = split//,'111xxx222';
my @y = split//,'111yyy222';
my $r;
for my $i(0..$#x) {
    $r .= $x[$i] eq $y[$i] ? $x[$i] : '.';
}
$r =~ s/(\.+)/($1)/g;
print $r

output:

111(...)222
M42
A: 

There's probably a much nicer way to do this but, in Python:

import re

f1 = "111xxx222"
f2 = "111yyy222"
f3 = "111zzz222"

pre = []
post = []
found_diff = 0

for i in range(len(f1)):
    if f1[i] == f2[i]:
        if found_diff == 0:
            pre.append(f1[i])
        else:
            post.append(f1[i])
    else:
        found_diff = 1

regex = "".join(pre) + "(.*)" + "".join(post)

re.compile(regex)
print re.match(regex, f3).group(1)

prints zzz.

NB this won't work if the first two strings are not the same length.

Vicky
+2  A: 

Now you have two problems.

You need to think hard about what the set of possible strings are, and what is the "right" regular expression for each input pair. I know it seems obvious to you right now but convincing your algorithm to do what you expect may be harder than you think. A little TDD is always a good idea: try to think of some pathological cases.

In your example, why shouldn't it generate something like one of these, instead of your suggestion (see @M42's answer for the first one)?:

111(...)222
111(...*)222
111(...+)222
111(..*)222
111(...?)222
(.*)111(.*)222(.*)
etc.

What would you like to happen if the third string is one of these? Or the second?

1111zzz222
111zzz2222
0111zzz222
111zzz2223
111zzzz222
111zz222
11zzz222
111zzz22
111zzz
zzz222
111xyz222
1z1z1z222
111222
1111222
1112222
etc.

If (as I suspect) you already know that some of these cases are going to be impossible, then you already know more about the possible structure of the strings than you are letting on. And if that's the case, you should probably just parse the strings and have done.

If you really really really want to try this anyway then the first thing you need to figure out is whether you want to match "longest common subsequence" or some variation of "longest common substring".

If you really are running this against arbitrary strings, then once you think you have a good algorithm, make sure to test it against a lot of sample data (which should be pretty easy to generate ...)

Zac Thompson
Alan Moore
@Alan: any criticism of the rest of my answer? Do you think the OP is 100% on the right track with this "generate-me-a-regex" approach? I've only used the quote in one place that I can recall (namely: on this page), and since I think it is utterly and precisely to the point in this particular case, I'll wait to give it a rest until I feel like I've been overusing it. Until then, here's a link to a page that describes my own thoughs on regular expressions very well: http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
Zac Thompson
I didn't mean *you* had overused it, just that it's been done to death already. Starting an answer with a worn-out zinger tends to make the author sound ignorant and prejudiced, which is unfortunate in this case, since this is actually a good answer.
Alan Moore
@Alan - good point. I'm going to leave it, since I really think it is appropriate in this case. Attention @Derek the asker: regular expressions are **awesome**, but they aren't always the best choice, is all I mean.
Zac Thompson