views:

313

answers:

2

I have the the string "re\x{0301}sume\x{0301}" (which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r" (émusér). I can't use Perl's reverse because it treats combining characters like "\x{0301}" as separate characters, so I wind up getting "\x{0301}emus\x{0301}er" ( ́emuśer). How can I reverse the string, but still respect the combining characters?

+9  A: 

You can use the \X special escape (match a non-combining character and all of the following combining characters) with split to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join them back together:

#!/usr/bin/perl

use strict;
use warnings;

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;
print "original: $original\n",
      "wrong:    $wrong\n",
      "right:    $right\n";
Chas. Owens
For those confused (as I was at first) about why there are empty strings between the graphemes, it's because the `split` is inverted: it uses the data that's wanted as the separator. The empty string is what's "between" two graphemes. It's only by including the separator in the result that you get the graphemes mixed in with the "real" result -- a bunch of empty strings. An alternative (and slightly faster) method that avoids that is to use an `m//g` to capture the graphemes instead: `join '', reverse $original =~ /(\X)/g`
Michael Carman
To clarify Michael's comment, when you use memory parenthesis in a regex you give to split, you trigger "separator retention mode". You get back the thing that goes between the parts you are splitting up. You don't need to do that however. The pattern (?=\X) does the same thing with no extra bits. Not that the empty string really matters that much for small strings.
brian d foy
+5  A: 

I modified Chas's example a bit:

  • Set the encoding on STDOUT to avoid "wide character in print" warnings;
  • Use a positive lookahead assertion (and no separator retention mode) in split

It's basically the same thing with a couple of tweaks.

use strict;
use warnings;

binmode STDOUT, ":utf8";

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(?=\X)/, $original;

print <<HERE;
original: [$original]
   wrong: [$wrong]
   right: [$right]
HERE
brian d foy
Wow. I like perl, but that split expression is pretty magical. My first thought was "brute force": make a function to do what the split does -- return an list of strings, each entry of which represents a logical character. However you get that list (call it @x), the join( '', reverse( @x) ) part obviously follows, fortunately.
Roboprog
Magical? How so? It's just a regex with no side effects and it only does exactly what you see. If you think that's magic, you haven't seen the real black arts of Perl. You might call it clever (although I wouldn't), but it's not magical. It's probably just something you haven't ever used.
brian d foy