views:

326

answers:

2

I am writing some Perl scripts to manipulate large amounts (in total about 42 million rows, but it won't be done in one hit) of data in two PostgreSQL databases.

For some of my queries it makes good sense to use fetchall_hashref because I have synthetic keys. However, in other instances, I'm going to have use an array of three columns as the unique key.

This has got me wondering about performance differences between fetchall_arrayref and fetchall_hashref. I know that in both cases everything is going in to memory so selecting several GB of data probably isn't a good idea but other than that there appears to be very little guidance in the documentation when it comes to performance.

My googling has been unsuccessful so if anyone can point me in the direction of some general performance studies I'd be grateful.

(I know I could benchmark this myself but unfortunately for dev purposes I don't have access to a machine which has identical hardware to production which is why I'm looking for general guidelines or even best practices).

+3  A: 

Most of the choices between fetch methods depend on what format you want the data to end up in and how much of the work for that you want DBI to do for you.

My recollection is that iterating with fetchrow_arrayref and using bind_columns is the fastest (least DBI overhead) way to read through returned data.

ysth
This matches with my own understanding.
fennec
...and with the docs. Per http://search.cpan.org/~timb/DBI-1.609/DBI.pm#fetchrow_arrayref "This is the fastest way to fetch data, particularly if used with $sth->bind_columns."
Dave Sherohman
Note that an editor narrowed the focus of the title of this question. It was, to me, ambiguous before then whether the whole question had that narrow focus, and I choose to answer more generally.
ysth
+2  A: 

First question is whether you really need to use a fetchall in the first place. If you don't need all 42 million rows in memory at once, then don't read them all in at once! bind_columns and fetchrow_arrayref are generally the way to go whenever possible, as ysth already pointed out.

Assuming that fetchall really is needed, my gut intuition is that fetchall_arrayref will be marginally faster, since an array is a simpler data structure and doesn't need to compute hashes of the inserted keys, but the savings in time would be dwarfed by database read times, so it's unlikely to be significant.

Memory requirements are another matter entirely, though. The structure returned by fetchall_hashref is a hash of id => row, with each row being represented as a hash of field name => field value. If you get 42 million rows, that means your list of field names is repeated in each of 42 million sets of hash keys... That's going to require a good deal more memory to store than the array of arrays of arrays returned by fetchall_arrayref. (Unless DBI is doing some magic with tie to optimize the fetchall_hashref structure, I suppose.)

Dave Sherohman
Thanks for this - as I'll definitely revisit using fetchall ... and reconsider the hash.
azp74