ansaurus

Question

How can I speed up Perl's processing of fixed-width data?

Answer 1

+1 A:

Yes. Extracting using substr is likely to be the fastest way to do it. That is:

$FIELDS::transaction_date = substr $_, 0, 8;
$FIELDS::customer_id      = substr $_, 8, 20;

is likely to be faster. Now, if I were handwriting this code, I would not give up unpack, but if you are generating the code, you might as well give it a shot and measure.

See also the answers to Is Perl’s unpack() ever faster than substr()?

As for stripping leading spaces, s/^\s+// is likely to be the fastest method.

Update: It is hard to say anything definite without being able to run benchmarks. However, how about:

my  $x  = substr $_, 0, 8;

for fields that do not need any trimming and

my ($y) = substr($_, 8, 20) =~ /\A\s+(.+?)\s+\z/;

that do need trimming?

Sinan Ünür 2009-10-19 13:56:58

I'll experiment and report back.

martin clayton 2009-10-19 14:14:01

One thing to note is that unpack 'A' strips trailing spaces 'for free'.

martin clayton 2009-10-19 15:19:28

Looks promising so far. Basic benchmarking shows a series of per-field substrs and strip trailing regexes is about 50% faster for us than using one unpack. Tests are passing, but my screen is full of Perlish warnings, so I'm not quite there yet.

martin clayton 2009-10-19 15:30:50

@martin clayton: What kind of warnings?

Sinan Ünür 2009-10-19 15:35:42

Nothing a judicial defaulting || couldn't take care of. We have some data that doesn't fill the entire record. I'll experiment with the different trailing space strips, because with none the substr version is much quicker than unpack.

martin clayton 2009-10-19 15:54:30

.+ is greedy, so you'll leave one space at the end of your string. You want .+? there.

runrig 2009-10-22 15:48:39

Answer 2

+6 A:

I've actually dealt with this problem over and over again. Unpack is better than substr.

As far as stripping spaces goes, you're pretty much screwed. That regex hack is the "official" way to do it. You might be able to gain some efficiency by refining your unpack statements (assuming no data is longer than 4 digits, why unpack the full 12 digits of the field?), but otherwise, the parsing is just a p.i.t.a.

Good luck with your flat data. Fricking legacy junk, how I hates it.

Satanicpuppy 2009-10-19 14:13:29

Thanks - pretty much what I thought too. It looks like Sinan's suggestion might be useful for us though.

martin clayton 2009-10-19 15:21:32

Answer 3

+3 A:

Are you sure that you are processor bound on this task? The computation is simple enough to suspect that the whole process is likely to be I/O bound. In which case optimizing for faster unpack won't gain you much time.

In case you're in fact processor bound, the problem as described seems pretty much parallelizable, but of course the devil is in the details of your business computation.

ttarchala 2009-10-19 14:22:12

Pretty sure, yes. In the real code the I/O and unpacking are separated, unlike in the potted example above. Parallelism is a good call - we do that already. Would still like to spend less time on chopping up strings though.

martin clayton 2009-10-19 14:38:24

For me the biggest bottleneck is the database itself; if you're dealing with flat data, you may well be dealing with a horrible ancient flat database. I get around the problem by pulling the data out at intervals to a modern database, but for "real time" data it's almost always the database that's causing the slowdowns.

Satanicpuppy 2009-10-19 16:06:27

Have to agree. Overall our process is spending most of the time slurping data into the database. We are able to use parallel processes to speed this up. It struck me that as such a large fraction of the 'Perl-time' was spent in such a small part of the code there might be scope for an easy optimise here.

martin clayton 2009-10-19 21:40:50

Answer 4

+1 A:

This could be also something for XS - so that you use a C function to change the data. I could imagine that this is a lot faster than anything else as you can manually control when data is really copied.
The build process will get more difficult as you have a dependency to the C compiler and will require additional integration steps.

weismat 2009-10-19 16:20:05

Thanks. Probably for my project the gain would not justify the extra complexity. As remarked by Satanicpuppy, there are other parts of the process that are more expensive than this. Could work for someone else with a similar problem though.

martin clayton 2009-10-19 21:34:51

If most of the CPU time is spent in the regex engine and a builtin function, then most of the time is spent in C land (i.e. not on walking optrees and handling perl data structures) anyway. And it takes a good programmer to beat the people who wrote the perl VM. unpack is fast!

tsee 2009-10-19 23:27:37

Answer 5

+1 A:

Simply make it go in parallel. It's trivial, and on any even remotely modern machine it will be faster.

depesz 2009-10-20 07:33:13

Thanks - we use parallelism (see ttarchala's answer above). It's been our most effective optimisation for the code that this is part of.

martin clayton 2009-10-20 13:20:50

Answer 6

A:

The benchmark of a substr-based version of our code suggested that it might be around 50% faster than our existing unpack. Comparing the codes in place in the actual application, the substr version gave us a 16% reduction in run time. This is close to what we'd hoped for based on the benchmark and the profiling referred to in the question.

This optimisation might be useful for us. But, we have a migration to a new OS on the horizon, so we'll wait and see how the codes perform there before proceeding. We've added a test to keep an eye on the comparative benchmarks.

The idiom we have now is:

$FIELDS::transaction_date = substr( $_, 0, 8 ) || '';
$FIELDS::transaction_date =~ s/\s+\z//;
$FIELDS::customer_id = substr( $_, 8, 20 ) || '';
$FIELDS::customer_id =~ s/\s+\z//;

Followed by selective leading space removal as before.

Thanks for all the answers so far. I'll accept Sinan's because it worked for us, despite appearing to be 'just plain wrong'.

martin clayton 2009-10-20 14:39:44

ansaurus

tags:

views:

answers:

How can I speed up Perl's processing of fixed-width data?

related questions