tags:

views:

1023

answers:

4

I came across this Perl construct today:

@foo = split("\n", $bar);

That works well for splitting a large string into an array of lines for UNIX-type line endings, but leaves a trailing \r for Windows. So I changed it to:

@foo = split("\r?\n", $bar);

Which splits the string by lines and doesn't leave a trailing \r (tested under ActivePerl 5.8). Then it was pointed out to me that this should probably be:

@foo = split(/\r?\n/, $bar);

So why does the second variant work at all? The double quotes mean that the contents are evaluated, which is why the \r and \n are actually treated as CR and LF, but the ? is treated as a regex metacharacter rather than a literal question mark.

Are the slashes around the regular expression just optional for split()? Is it just assumed that the first parameter to the function will be a regex?

+6  A: 

The slashes are just the standard delimiters for a regular expression (you can use others), and they evaluate special characters and escape sequences just like double quotes.

EDIT: I shot too fast, as Manni explained in the comment. I'll try a longer explanation:

Usually, matching regexes in Perl start with m, and the regex body is then enclosed in some delimiter. The standard delimiter for matching regexes is the slash, and you can omit the leading m if you use slashes as delimiter:

m/\r?\n/
m"\r?\n"
m$\r?\n$
/\r?\n/

These all do the same, and they are called "regex literals". If you use single quotes, escape sequences don't get evaluated.

At this point, it seems strange that your first attempt, with the regex in double quotes but without the leading m, worked at all, but, as Arnshea explained, split is a special case in that it accepts the regex not only as a literal, but also as a string.

Svante
That's not quite right. If you use other delimiters, you will have to explicitly specify your operation. split m$\n$,@_ will work, but split $\n$, @_ will not. The correct answer is Arnshea's.
innaM
Thanks for clearing this up.
innaM
+7  A: 

You can pass split a regex as a string or a regex literal. So passing it as a double-quoted string is fine.

You can also delimit regular expression literals with characters other than the standard /regex/

Arnshea
+4  A: 

Yes, split always takes a regex (except for the string containing a single space special case). If you give it a string, that will be used as a regex. The same thing happens with =~ (e.g. $foo =~ "pattern"). And the regex metacharacters will be treated as such regardless of use of //.

Which is why it is a good idea to always use //, to emphasize that it isn't sometimes a literal string or sometimes a regex so you don't accidentally try split("|", "a|b|c") someday.

ysth
+1  A: 

Lets see the benchmarks of several alternatives.

use Modern::Perl;
use Benchmark qw'cmpthese';

# set up some test data
my $bar = join "\n", 'a'..'z';

my $qr  = qr/\r?\n/;
my $str =   "\r?\n";
my $qq  = qq/\r?\n/;

my %test = (
  '   //' =>   sub{ split(   /\r?\n/, $bar ); },
  '  m//' =>   sub{ split(  m/\r?\n/, $bar ); },
  '  m""' =>   sub{ split(  m"\r?\n", $bar ); },
  ' qr//' =>   sub{ split( qr/\r?\n/, $bar ); },
  ' qq//' =>   sub{ split( qq/\r?\n/, $bar ); },
  '   ""' =>   sub{ split(   "\r?\n", $bar ); },
  '$qr  ' =>   sub{ split( $qr,  $bar ); },
  '$str ' =>   sub{ split( $str, $bar ); },
  '$qq  ' =>   sub{ split( $qq,  $bar ); }
);

cmpthese( -5, \%test, 'auto');
Benchmark: running    
    "",    //,   m"",   m//,  qq//,  qr//, $qq  , $qr  , $str  
    for at least 5 CPU seconds...

      "":  6 wallclock secs ( 5.21 usr +  0.02 sys =  5.23 CPU) @ 42325.81/s (n=221364)
      //:  6 wallclock secs ( 5.26 usr +  0.00 sys =  5.26 CPU) @ 42626.24/s (n=224214)
     m"":  6 wallclock secs ( 5.30 usr +  0.01 sys =  5.31 CPU) @ 42519.96/s (n=225781)
     m//:  6 wallclock secs ( 5.20 usr +  0.00 sys =  5.20 CPU) @ 42568.08/s (n=221354)
    qq//:  6 wallclock secs ( 5.24 usr +  0.01 sys =  5.25 CPU) @ 42707.43/s (n=224214)
    qr//:  6 wallclock secs ( 5.11 usr +  0.03 sys =  5.14 CPU) @ 33277.04/s (n=171044)
   $qq  :  5 wallclock secs ( 5.15 usr +  0.00 sys =  5.15 CPU) @ 42154.76/s (n=217097)
   $qr  :  4 wallclock secs ( 5.28 usr +  0.00 sys =  5.28 CPU) @ 39593.94/s (n=209056)
   $str :  6 wallclock secs ( 5.29 usr +  0.00 sys =  5.29 CPU) @ 41843.86/s (n=221354)


         Rate  qr//   $qr  $str   $qq    ""   m""   m//    //  qq//
 qr// 33277/s    --  -16%  -20%  -21%  -21%  -22%  -22%  -22%  -22%
$qr   39594/s   19%    --   -5%   -6%   -6%   -7%   -7%   -7%   -7%
$str  41844/s   26%    6%    --   -1%   -1%   -2%   -2%   -2%   -2%
$qq   42155/s   27%    6%    1%    --   -0%   -1%   -1%   -1%   -1%
   "" 42326/s   27%    7%    1%    0%    --   -0%   -1%   -1%   -1%
  m"" 42520/s   28%    7%    2%    1%    0%    --   -0%   -0%   -0%
  m// 42568/s   28%    8%    2%    1%    1%    0%    --   -0%   -0%
   // 42626/s   28%    8%    2%    1%    1%    0%    0%    --   -0%
 qq// 42707/s   28%    8%    2%    1%    1%    0%    0%    0%    --

It's worth noting that these are all essentially the same speed, with qr// showing up to be slightly slower. After running this test several times, qr// and $qr were always the slowest, and second slowest, of the bunch. With the others exchanging places regularly.

So basically it doesn't matter how you set up the regex for  split().

Brad Gilbert
I'd be interested in knowing what you think of [nextgen](http://search.cpan.org/~ecarroll/nextgen/), it is like a more modern Modern::Perl.
Evan Carroll