I have a Perl script that crunches a lot of data. There are a bunch of string variables that start small but grow really long due to the repeated use of the dot (concatentation) operator. Will growing the string in this manner result in repeated reallocations? If yes, is there a way to pre-allocate a string?
Alternate suggestion that will be much easier to cope with: push
the strings onto an array and join
it when you're done.
Yes, pre-extending strings that you know will grow is a good idea.
You can use the 'x' operator to do this. For example, to preallocate 1000 spaces:
$s = " " x 1000:
Perl's strings are mutable, so appending to a string does NOT incur a string duplication penalty.
You can try all you want to find a "faster" way, but this smells really bad of premature optimization.
For an example, I whipped up a class that abstracted away the hard work. It works perfectly, but it's, for all its goofy tricks, really slow.
Here's the result:
Rate magic normal
magic 1.72/s -- -93%
normal 23.9/s 1289% --
Yes, that's right, Perl is 1200% faster than what I thought was a respectable implementation.
Profile your code and find what the real problems are, don't try optimising stuff that isn't even a known problem.
#!/usr/bin/perl
use strict;
use warnings;
{
package MagicString;
use Moose;
has _buffer => (
isa => 'Str',
is => 'rw',
);
has _buffer_size => (
isa => 'Int',
is => 'rw',
default => 0,
);
has step_size => (
isa => 'Int',
is => 'rw',
default => 32768,
);
has _tail_pos => (
isa => 'Int',
is => 'rw',
default => 0,
);
sub BUILD {
my $self = shift;
$self->_buffer( chr(0) x $self->step_size );
}
sub value {
my $self = shift;
return substr( $self->{buffer}, 0, $self->{_tail_pos} );
}
sub append {
my $self = shift;
my $value = shift;
my $L = length($value);
if ( ( $self->{_tail_pos} + $L ) > $self->{_buffer_size } ){
$self->{buffer} .= (chr(0) x $self->{step_size} );
$self->{_buffer_size} += $self->{step_size};
}
substr( $self->{buffer}, $self->{_tail_pos}, $L, $value );
$self->{_tail_pos} += $L;
}
__PACKAGE__->meta->make_immutable;
}
use Benchmark qw( :all :hireswallclock );
cmpthese( -10 , {
magic => sub{
my $x = MagicString->new();
for ( 1 .. 200001 ){
$x->append( "hello");
}
my $y = $x->value();
},
normal =>sub{
my $x = '';
for ( 1 .. 200001 ){
$x .= 'hello';
}
my $y = $x;
}
});
#use Data::Dumper;
#print Dumper( length( $x->value() ));
Yes, Perl growing a string will result in repeated reallocations. Perl allocates a little bit of extra space to strings, but only a few bytes. You can see this using Devel::Peek. This reallocation is very fast and often does not actually copy the memory. Trust your memory manager, that's why you're programming in Perl and not C. Benchmark it first!
You can preallocate arrays with $#array = $num_entries
and a hash with keys %hash = $num_keys
but length $string = $strlen
doesn't work. Here's a clever trick I dug up on Perlmonks.
my $str = "";
vec($str, $length, 8)=0;
$str = "";
Or if you want to get into XS you can call SvGROW()
.
chaos' suggestion to use an array and then join it all together will use more than double the memory. Memory for the array. Memory for each scalar allocated for each element in the array. Memory for the string held in each scalar element. Memory for the copy when joining. If it results in simpler code, do it, but don't think you're saving any memory.
I would go the array/join way:
push(@array, $crunched_bit)
And then $str = join('', @array)
, if nothing more, to have access to all the elements for debugging at some later time.
Growing a scalar via concatenation will result in memory allocation, though as Schwern points out, this may not happen with every concatenation and may not actually cause a performance problem.
See this PerlMonks discussion. You can use Convert::Scalar to preallocate memory for a scalar.
#!/path/to/perl
use strict;
use warnings;
use Convert::Scalar qw(grow);
use Devel::Size qw(size);
my $string = 'foo';
# contains 'foo', length 3, size 28
print "string contains '$string'", "\n";
print "length of string is ", length($string), "\n";
print "size of scalar in bytes is ", size($string), "\n\n";
grow($string, 1000000);
# contains 'foo', length 3, size 10000024
print "string contains '$string'", "\n";
print "length of string is ", length($string), "\n";
print "size of scalar in bytes is ", size($string), "\n\n";
# this should not allocate any more memory
while (length($string) < 100000)) {
$string .= 'bar';
}
# length 1000002, size 10000024
print "length of string is ", length($string), "\n";
print "size of scalar in bytes is ", size($string), "\n
I don't know specifically how Perl strings are implemented but a pretty good guess is that it's constant amortized time. This means that even if you do find a way to pre-allocate your string chances are that the combined time it will save for all the script's users will be less than the time you spent asking this question on Stack Overflow.