tags:

views:

167

answers:

3

Your task, should you choose to accept it, is to write a Perl regular expression that for a given string, will return the first occurence of a character that is not consecutively duplicated. In other words, both preceded AND succeeded by characters different from itself (or start/end of string respectively).

Example:

IN: aabbcdecc
OUT: c

Please note that "not consecutively duplicated" does not mean "anywhere in the string".

NOTE: it must be a pure regex expression. E.g. the solution that obviously comes to mind (clone the string, delete all the duplicates, and print the first remaining character) does not count, although it solves the problem.

The question is inspired by my somewhat off-topic answer to this: http://stackoverflow.com/questions/2548606/perl-function-to-find-first-non-repeating-character-in-a-string

+1  A: 
use 5.010;
$str=~/^(([a-z])\g{-1}+)*(?<c>[a-z])/i;
$char = $+{c};
Anon
Wrong. `[a-z]{2,}` will match `abc`.
KennyTM
o crap good point
Anon
try now I fixed it
Anon
Error: Sequence (?<c...) not recognized in regex;
DVK
@DVK: that's the 5.10 labeled capture feature, as is the \g relative back reference.
brian d foy
@brian - illuminating as usual. I paid some attention to 5.10 out if curiocity but at work I'm still suck in late Jurassic with 5.8 and smatterings of 5.005 *gag* :(
DVK
+2  A: 
(?:(.)\1+)*(.?)

Get the 2nd capture. (Will return an empty string if every character is consecutively duplicated.)

Test cases:

~:2434$ perl -e "\"abc\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
a
~:2435$ perl -e "\"aabbcc\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"

~:2436$ perl -e "\"aabbc\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
c
~:2437$ perl -e "\"aabcc\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
b
~:2438$ perl -e "\"aabcbbbcccccc\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
b
~:2439$ perl -e "\"aabbvbbcccccc\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
v
~:2440$ perl -e "\"aabbcdecc\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
c
~:2441$ perl -e "\"aabbccddeef\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
f
~:2442$ perl -e "\"faabbccddeef\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
f
~:2443$ perl -e "\"faabbccddeefax\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
f
~:2444$ perl -e "\"xfaabbccddeefx\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
x
~:2445$ perl -e "\"xabcdefghai\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
x
~:2446$ perl -e "\"cccdddeeea12345\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
a
~:2447$ perl -e "\"1234a5678a23\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"
1

Or (will not match if every character is consecutively duplicated.)

(?:^|(.)(?!\1))(.)(?!\2)
KennyTM
Does not work ;(. the captures were "a" and "b". I will leave it to you to figure out why :)
DVK
Whoever up-voted this, please take it back - it does not work!
DVK
@Kenny - second try's much better. You caught the error quick. I got this far on my own, but I'm also stuck on "what if no duplicates upfront".
DVK
@DVK: `perl -e "\"aabbcdecc\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"` gives me `c`; @spong `perl -e "\"abcdefg\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"` gives me `a`. I don't know why it's wrong with you guys.
KennyTM
@kenny - sorry, I was testing your original expression (.)
DVK
@DVK: Ah I see.
KennyTM
After the edit, the first expression works!
DVK
Curiously, your first regex matches any string because you use the * (zero or more) and ? (zero or one) can always match zero cases of anything.
brian d foy
@brian: Justify your claim. They clearly work. `perl -e "\"aabbccddeef\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"` prints `f`, `perl -e "\"cccdddeeea12345\" =~ m/(?:(.)\1+)*(.?)/; print \$2;"` prints `a`.
KennyTM
@brian: And the second regex is able to match 123 is *by-design*. The OP wants to match **not consecutively duplicated** characters, not **duplicated** characters.
KennyTM
I misunderstood the problem. My apologies.
brian d foy
A: 

I wish Perl had a regex negate flag! ie, return all the characters that do NOT match /regex/

What you are looking for is really the regex capture complement of:

m/(.)(\1)+/

I tried all the suggestions on this page against Brian's data list (the result of in his program listing). None work completely.

The regex of:

(?:^|(.)(?!\1))(.)(?!\2) 

fails to match the beginning 'f' in line 2 and 3. Brian's does not match the 'f' at the beginning of line 2 and 3 or any of the singletons at the end of line 5.

The regex of:

$str=~/^(([a-z])\g{-1}+)*(?<c>[a-z])/i;
$char = $+{c};

does work.

The only single regex that I found is a simple one:

#!/usr/bin/perl
while( <DATA> ) {
    chomp;
    print "BEFORE: $_\n";
    s/(.)(\1)+//g;
    print "AFTER: $_\n";
    print "charater: " . substr($_,0,1) . "\n\n";
 }
__END__
aabbccddeef
faabbccddeef
faabbccddeefax
xfaabbccddeefx
xabcdefghai
cccdddeeea12345
1234a5678a23
aabbcdecc
abcdefg
aabbccddeef
cccdddeeea12345

This works in the simple case of 'give the first character.' ((edit: reread: sorry, I now read that the obvious delete the doubles was not what you were looking for...))

Love to hear if there is a better solution.

drewk
This one seems to work: m/(?:(.)\1+)*(.?)/
DVK
This: m/(?:(.)\1+)*(.?)/ indeed works! Even the singletons on the beginning of lines and in groups are in \2.Nice puzzle...
drewk