tags:

views:

89

answers:

5

I would like to optimise this Perl sub:

push_csv($string,$addthis,$position);

for placing strings inside a CSV string.

e.g. if $string="one,two,,four"; $addthis="three"; $position=2;
then push_csv($string,$addthis,$position) will change the value of $string = "one,two,three,four";

sub push_csv {

    my @fields = split /,/, $_[0]; # split original string by commas;
    $_[1] =~ s/,//g;               # remove commas in $addthis
    $fields[$_[2]] = $_[1];        # put the $addthis string into
                                   # the array position $position.
    $_[0] = join ",", @fields;     # join the array with commas back
                                   # into the string.
}

This is a bottleneck in my code, as it needs to be called a few million times.

If you are proficient in Perl, could you take a look at it, and suggest optimisation/alternatives? Thanks in advance! :)


EDIT: Converting to @fields and back to string is taking time, I just thought of a way to speed it up where I have more than one sub call in a row. Split once, then push more than one thing into the array, then join once at the end.

A: 

Don't you think it might be easier to use arrays and splice, and only use join to create the comma separation at the end?

I really don't think using s/// repeatedly is a good idea if this is a major bottleneck in your code.

Platinum Azure
Thanks, I will read up on splice soon.
Dave
+3  A: 

For several reasons, you should be using Text::CSV to handle these low-level CSV details. Provided that you are able to install the XS version, my understanding is that it will run faster than anything you can do in pure Perl. In addition, the module will correctly handle all sorts of edge cases that you are likely to miss.

use Text::CSV;
my $csv = Text::CSV->new;

my $line = 'foo,,fubb';
$csv->parse($line);

my @fields = $csv->fields;
$fields[1] = 'bar';

$csv->combine(@fields); 
print $csv->string;      # foo,bar,fubb
FM
But it will not build for him a string, no? If his CSV doesn't have edge cases (which is probably true if his function is already working) the problem is not parsing the CSV, but constructing a new CSV string speedily
Vinko Vrsalovic
@Vinko Vrsalovic The module parses and builds CSV strings.
FM
Then you get my vote! :) (and for the sample too)
Vinko Vrsalovic
Thanks! I installed Text::CSV_XS, and ran my code again after implementing your code sample, and after initial testing the code is about the same speed, and will run more tests tomorrow (its getting late here). Anyway, thanks, I didnt know about XS before and assumed Text::CSV was slow. But you are saying it should be much faster.
Dave
This should really be in a FAQ somewhere: use Text::CSV
Axeman
Axeman, http://perldoc.perl.org/perlfaq4.html#How-can-I-split-a-%5bcharacter%5d-delimited-string-except-when-inside-%5bcharacter%5d%3f
daxim
+2  A: 

Keep your array as an array in the first place, not as a ,-separated string?

You might want to have a look at Data::Locations.

Or try (untested, unbenchmarked, doesn't append new fields like your original can...)

sub push_csv {
    $_[1] =~ y/,//d;
    $_[0] =~ s/^(?:[^,]*,){$_[2]}\K[^,]*/$_[1]/;
    return;
}
ysth
+1 for an s!!! approach, and a "goodwill" +1 for \K
pilcrow
+1  A: 

A few suggestions:

  • Use tr/,//d instead of s/,//g as it is faster. This is essentially the same as ysth's suggestion to use y/,//d
  • Perform split only as much as is needed. If $position = 1, and you have 10 fields, then you're wasting computation performing unnecessary splits and joins. The optional third argument to split can be leveraged to your advantage here. However, this does depend on how many consecutive empty fields you are expecting. It may not be worth it if you don't know ahead of time how many of these you have
  • You're quite right in wanting to perform multiple appends with one sub-call. There is no need to perform multiple splits and joins when one will do just as well

You really ought to be using Text::CSV, but here's how I would revise the implementation of your sub in pure Perl (assuming a maximum of one consecutive empty field):

sub push_csv {

    my ( $items, $positions ) = @_[1..2];

    # Test inputs
    warn "No. of items to add & positions not equal"
      and
        return unless @{$items} == @{$positions};

    my $maxPos;  # Find the maximum position number

    for my $position ( @{$positions} ) {

        $maxPos ||= $position;
        $maxPos = $position if $maxPos < $position;
    }

    my @fields = split /,/ , $_[0], $maxPos+2;  # Split only as much as needed

    splice ( @fields, $positions->[$_], 1, $items->[$_] ) for 0 .. $#{$items};
    $_[0] = join ',' , @fields;
    print $_[0],"\n";
}

Usage

use strict;
use warnings;

my $csvString = 'one,two,,four,,six';
my @missing = ( 'three', 'five' );
my @positions = ( 2, 4 );

push_csv ( $csvString, \@missing, \@positions );
print $csvString;   # Prints 'one,two,three,four,five,six'
Zaid
+1  A: 

If you're hitting a bottleneck by splitting and joining a few million times... then don't split and join constantly. split each line once when it initially enters the system, pass that array (or, more likely, a reference to the array) around while doing your processing, and then do a single join to turn it into a string when you're ready for it to leave the system.

e.g.:

#!/usr/bin/env perl

use strict;
use warnings;

# Start with some CSV-ish data
my $initial_data = 'foo,bar,baz';

# Split it into an arrayref
my $data = [ split /,/, $initial_data ];

for (1 .. 1_000_000) {
  # Pointless call to push_csv, just to make it run
  push_csv($data, $_, $_ % 3);
}

# Turn it back into a string and display it
my $final_data = join ',', @$data;
print "Result: $final_data\n";

sub push_csv {
  my ($data_ref, $value, $position) = @_;
  $$data_ref[$position] = $value;
  # Alternately:
  # $data_ref->[$position] = $value;
}

Note that this simplifies things enough that push_csv becomes a single, rather simple, line of processing, so you may want to just do the change inline instead of calling a sub for it, especially if runtime efficiency is a key criterion - in this trivial example, getting rid of push_csv and doing it inline reduced run time by about 70% (from 0.515s to 0.167s).

Dave Sherohman