views:

285

answers:

6

We've got a mature body of code that loads data from files into a database. There are several file formats; they are all fixed-width fields.

Part of the code uses the Perl unpack() function to read fields from the input data into package variables. Business logic is then able to refer to these fields in a 'human-readable' way.

The file reading code is generated from a format description once, prior to reading a file.

In sketch form, the generated code looks like this:

while ( <> ) {

    # Start of generated code.

    # Here we unpack 2 fields, real code does around 200.
    ( $FIELDS::transaction_date, $FIELDS::customer_id ) = unpack q{A8 A20};

    # Some fields have leading space removed
    # Generated code has one line like this per affected field.
    $FIELDS::customer_id =~ s/^\s+//;

    # End of generated code.

    # Then we apply business logic to the data ...
    if ( $FIELDS::transaction_date eq $today ) {
        push @fields, q{something or other};
    }

    # Write to standard format for bulk load to the database.
    print $fh join( '|', @fields ) . q{\n} or die;
}

Profiling the code reveals that around 35% of the time is spent in the unpack and leading-space strip. The remaining time is spent in validating and transforming the data, and writing to the output file.

It appears that there is no single part of the business logic that takes more than 1-2% of the run time.

The question is - can we eke out a bit more speed from the unpacking and space stripping somehow? Preferably without having to refactor all the code that refers to the FIELDS package variables.

EDIT:

In case it makes a difference

$ perl -v
This is perl, v5.8.0 built for PA-RISC1.1
+1  A: 

Yes. Extracting using substr is likely to be the fastest way to do it. That is:

$FIELDS::transaction_date = substr $_, 0, 8;
$FIELDS::customer_id      = substr $_, 8, 20;

is likely to be faster. Now, if I were handwriting this code, I would not give up unpack, but if you are generating the code, you might as well give it a shot and measure.

See also the answers to Is Perl’s unpack() ever faster than substr()?

As for stripping leading spaces, s/^\s+// is likely to be the fastest method.

Update: It is hard to say anything definite without being able to run benchmarks. However, how about:

my  $x  = substr $_, 0, 8;

for fields that do not need any trimming and

my ($y) = substr($_, 8, 20) =~ /\A\s+(.+?)\s+\z/;

that do need trimming?

Sinan Ünür
I'll experiment and report back.
martin clayton
One thing to note is that unpack 'A' strips trailing spaces 'for free'.
martin clayton
Looks promising so far. Basic benchmarking shows a series of per-field substrs and strip trailing regexes is about 50% faster for us than using one unpack. Tests are passing, but my screen is full of Perlish warnings, so I'm not quite there yet.
martin clayton
@martin clayton: What kind of warnings?
Sinan Ünür
Nothing a judicial defaulting || couldn't take care of. We have some data that doesn't fill the entire record. I'll experiment with the different trailing space strips, because with none the substr version is much quicker than unpack.
martin clayton
.+ is greedy, so you'll leave one space at the end of your string. You want .+? there.
runrig
+6  A: 

I've actually dealt with this problem over and over again. Unpack is better than substr.

As far as stripping spaces goes, you're pretty much screwed. That regex hack is the "official" way to do it. You might be able to gain some efficiency by refining your unpack statements (assuming no data is longer than 4 digits, why unpack the full 12 digits of the field?), but otherwise, the parsing is just a p.i.t.a.

Good luck with your flat data. Fricking legacy junk, how I hates it.

Satanicpuppy
Thanks - pretty much what I thought too. It looks like Sinan's suggestion might be useful for us though.
martin clayton
+3  A: 

Are you sure that you are processor bound on this task? The computation is simple enough to suspect that the whole process is likely to be I/O bound. In which case optimizing for faster unpack won't gain you much time.

In case you're in fact processor bound, the problem as described seems pretty much parallelizable, but of course the devil is in the details of your business computation.

ttarchala
Pretty sure, yes. In the real code the I/O and unpacking are separated, unlike in the potted example above. Parallelism is a good call - we do that already. Would still like to spend less time on chopping up strings though.
martin clayton
For me the biggest bottleneck is the database itself; if you're dealing with flat data, you may well be dealing with a horrible ancient flat database. I get around the problem by pulling the data out at intervals to a modern database, but for "real time" data it's almost always the database that's causing the slowdowns.
Satanicpuppy
Have to agree. Overall our process is spending most of the time slurping data into the database. We are able to use parallel processes to speed this up. It struck me that as such a large fraction of the 'Perl-time' was spent in such a small part of the code there might be scope for an easy optimise here.
martin clayton
+1  A: 

This could be also something for XS - so that you use a C function to change the data. I could imagine that this is a lot faster than anything else as you can manually control when data is really copied.
The build process will get more difficult as you have a dependency to the C compiler and will require additional integration steps.

weismat
Thanks. Probably for my project the gain would not justify the extra complexity. As remarked by Satanicpuppy, there are other parts of the process that are more expensive than this. Could work for someone else with a similar problem though.
martin clayton
If most of the CPU time is spent in the regex engine and a builtin function, then most of the time is spent in C land (i.e. not on walking optrees and handling perl data structures) anyway. And it takes a good programmer to beat the people who wrote the perl VM. unpack is fast!
tsee
+1  A: 

Simply make it go in parallel. It's trivial, and on any even remotely modern machine it will be faster.

depesz
Thanks - we use parallelism (see ttarchala's answer above). It's been our most effective optimisation for the code that this is part of.
martin clayton
A: 

The benchmark of a substr-based version of our code suggested that it might be around 50% faster than our existing unpack. Comparing the codes in place in the actual application, the substr version gave us a 16% reduction in run time. This is close to what we'd hoped for based on the benchmark and the profiling referred to in the question.

This optimisation might be useful for us. But, we have a migration to a new OS on the horizon, so we'll wait and see how the codes perform there before proceeding. We've added a test to keep an eye on the comparative benchmarks.

The idiom we have now is:

$FIELDS::transaction_date = substr( $_, 0, 8 ) || '';
$FIELDS::transaction_date =~ s/\s+\z//;
$FIELDS::customer_id = substr( $_, 8, 20 ) || '';
$FIELDS::customer_id =~ s/\s+\z//;

Followed by selective leading space removal as before.

Thanks for all the answers so far. I'll accept Sinan's because it worked for us, despite appearing to be 'just plain wrong'.

martin clayton