views:

191

answers:

7

I have been wondering if there's a nicer, but concise way for splitting a string into its characters

@characters = split //, $string

is not that hard to read, but somehow the use of a regular expression looks like overkill to me.

I have come up with this:

@characters = map { substr $string, $_, 1 } 0 .. length($string) - 1

but I find it uglier and less readable. What is your preferred way of splitting that string into its characters?

+4  A: 

You're right. The standard way to do it is split //, $string. To make code more readable you can create a simple function:

sub get_characters {
    my ($string) = @_;
    return ( split //, $string );
}

@characters = get_characters($string);
Ivan Nevostruev
... and add comments inside the sub to describe the implementation.
toolic
+5  A: 

It doesn't get much clearer than using the split function to split a string. I suppose you could argue that the null pattern is unintuitive; though I find it clear enough. If you want a "clean" alternative wrap it in a sub:

my @characters = chars($string);
sub chars { split //, $_[0] }
Michael Carman
You should shift it off too, right?
Mark Canlas
You can if you think it's clearer but for such a small and simple function I generally don't bother. Note that the function doesn't change the value of `$_[0]`. If I were modifying the value I'd make a copy to avoid unexpected side effects for the caller.
Michael Carman
+2  A: 

Use split with a null pattern to break up the string into individual characters:

@characters = split //, $string;

If you just want the char codes, use unpack:

@values = unpack("C*", $string);

You may need to include use utf8 for unpack to work properly. And you can also use unpack + chr to split the string into individual characters, just TMTOWTDI:

@characters = map chr, unpack("C*", $string);
eugene y
If your motto is "of every possible way to do it, pick the most unreadable one" this is a nice candidate. I'm not so bad at picking up new idiom, but pack/unpack somehow escape my grip.
reinierpost
Question is, is this faster than a split?
DVK
`@characters = unpack '(a)*', $string;`seems to work, too. Let's see what else we can dig up. :-)
hillu
@hillu: If you want more obfuscated ways to do it... http://www.perlmonks.org/?node_id=54413
toolic
I wasn't really looking for golfed / unreadable ways to do it, but almost all the answers seem to be headed in that general direction, so what the heck..
hillu
pack and unpack may be a little cryptic, but they are very fast. For something pack or unpack can do directly, usually the only faster way to do it is in C.
Eric Strom
+5  A: 

For less readable and more concise (and still with regex overkill):

@characters = $string =~ /./g;

(I learned this idiom from playing code-golf.)

mobrule
Uh. This is disgusting in an exciting way. +1 :-)
hillu
+4  A: 

I prefer using the split technique. It is well-known, and it is documented.

Yet another way...

@characters = $string =~ /./gs;
toolic
+1 (See my comment to mobrule's post)
hillu
+4  A: 

Why would using a regular expression be "overkill"? Many worry that regexes in Perl are overkill because they think that running them involves a highly complex and slow regex algorithm. That's not always true: the implementation is highly optimized and many simple cases are treated specially: what looks like a regex may actually perform as well as a simple substring search. I wouldn't be surprised at all if this type of split is optimized as well. split is faster than your map in some tests I ran. unpack appears to be slightly faster than split.

I recommend split because it is the "idiomatic" way. You'll find it in perldoc, in many books, and any good Perl programmer should know it (if you are not sure your audience will understand it, you can always add a comment to the code like someone suggested.)

OTOH, if regexes are "overkill" only because the syntax is ugly, then it's too subjective for me to say anything. ;-)

itub
Great answer.On overkill: I did not consider the run time at all. Regular expressions are integrated great into Perl, but when reading and trying to understand code, they often still require shifting one's mind. Which isn't much of a problem with that "idiomatic" expression using `split` and the empty match.
hillu
+8  A: 

Various examples, and speed comparisons.

I thought it might be a good idea to see how fast some of the ways are to split a string on every character.

I ran the test against several versions of Perl that I happen to have on my computer.

test.pl

use 5.010;
use Benchmark qw(:all) ;
my %bench = (
   'split' => sub{
     state $string = 'x' x 1000;
     my @chars = split //, $string;
     \@chars;
   },
   'split-string' => sub{
     state $string = 'x' x 1000;
     my @chars = split '', $string;
     \@chars;
   },
   'split-capture' => sub{
     state $string = 'x' x 1000;
     my @chars = split /(.)/, $string;
     \@chars;
   },
   'unpack' => sub{
     state $string = 'x' x 1000;
     my @chars = unpack( '(a)*', $string );
     \@chars;
   },
   'match' => sub{
     state $string = 'x' x 1000;
     my @chars = $string =~ /./gs;
     \@chars;
   },
   'match-capture' => sub{
     state $string = 'x' x 1000;
     my @chars = $string =~ /(.)/gs;
     \@chars;
   },
   'map-substr' => sub{
     state $string = 'x' x 1000;
     my @chars = map { substr $string, $_, 1 } 0 .. length($string) - 1;
     \@chars;
   },
);
# set the initial state of $string
$_->() for values %bench;
cmpthese( -10, \%bench );
for perl in /usr/bin/perl /opt/perl-5.10.1/bin/perl /opt/perl-5.11.2/bin/perl;
do
  $perl -v | perl -nlE'if( /(v5\.\d+\.\d+)/ ){
    say "## Perl $1";
    say "<pre>";
    last;
  }';
  $perl test.pl;
  echo -e '</pre>\n';
done

Perl v5.10.0

               Rate split-capture match-capture map-substr match unpack split split-string
split-capture 296/s            --          -20%       -20%  -23%   -58%  -63%         -63%
match-capture 368/s           24%            --        -0%   -4%   -48%  -54%         -54%
map-substr    370/s           25%            0%         --   -3%   -48%  -53%         -54%
match         382/s           29%            4%         3%    --   -46%  -52%         -52%
unpack        709/s          140%           93%        92%   86%     --  -11%         -11%
split         793/s          168%          115%       114%  107%    12%    --          -0%
split-string  795/s          169%          116%       115%  108%    12%    0%           --

Perl v5.10.1

               Rate split-capture map-substr match-capture match unpack split split-string
split-capture 301/s            --       -31%          -41%  -47%   -60%  -65%         -66%
map-substr    435/s           45%         --          -14%  -23%   -42%  -50%         -50%
match-capture 506/s           68%        16%            --  -10%   -32%  -42%         -42%
match         565/s           88%        30%           12%    --   -24%  -35%         -35%
unpack        743/s          147%        71%           47%   32%     --  -15%         -15%
split         869/s          189%       100%           72%   54%    17%    --          -1%
split-string  875/s          191%       101%           73%   55%    18%    1%           --

Perl v5.11.2

               Rate split-capture match-capture match map-substr unpack split-string split
split-capture 300/s            --          -28%  -32%       -38%   -59%         -63%  -63%
match-capture 420/s           40%            --   -5%       -13%   -42%         -48%  -49%
match         441/s           47%            5%    --        -9%   -39%         -46%  -46%
map-substr    482/s           60%           15%    9%         --   -34%         -41%  -41%
unpack        727/s          142%           73%   65%        51%     --         -10%  -11%
split-string  811/s          170%           93%   84%        68%    12%           --   -1%
split         816/s          171%           94%   85%        69%    12%           1%    --

As you can see split is the quickest, owing to the fact that this is a special case in the code for split.

split-capture is the slowest, probably because it has to set $1, along with several other match variables.

So I would recommend going with plain old split //, ..., or the roughly equivalent split '', ....

Brad Gilbert
+1 good comparision. I'm surprised that `unpack '(a)*'` isn't faster. It would be good to see this with unicode strings as well.
Eric Strom