views:

418

answers:

5

I have a number of scripts that currently read in a lot of data from some .CSV files. For efficiency, I use the Text::CSV_XS module to read them in and then create a hash using one of the columns as an index. However, I have a lot of files and they are quite large. And each of the scripts needs to read in the data all over again.

The question is: How can I have persistent storage of these Perl hashes so that all them can be read back in with a minimum of CPU?

Combining the scripts is not an option. I wish...

I applied the 2nd rule of optimization and used profiling to find that the vast majority of the CPU (about 90%) was in:

Text::CSV_XS::fields
Text::CSV_XS::Parse
Text::CSV_XS::parse

So, I made a test script that read in all the .CSV files (Text::CSV_XS), dumped them using the Storable module, and then went back and read them back in using the Storable module. I profiled this so I could see the CPU times:

$ c:/perl/bin/dprofpp.bat
Total Elapsed Time = 1809.397 Seconds
  User+System Time = 950.5560 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 25.6   243.6 243.66    126   1.9338 1.9338  Storable::pretrieve
 20.5   194.9 194.92 893448   0.0002 0.0002  Text::CSV_XS::fields
 9.49   90.19 90.198 893448   0.0001 0.0001  Text::CSV_XS::Parse
 7.48   71.07 71.072    126   0.5641 0.5641  Storable::pstore
 4.45   42.32 132.52 893448   0.0000 0.0001  Text::CSV_XS::parse
 (the rest was in terms of 0.07% or less and can be ignored)

So, using Storable costs about 25.6% to load back in as compared to Text::CSV_XS at about 35%. Not a lot of savings...

Has anybody got a suggestion on how I can read in these data more efficiently?

Thanks for your help.

+9  A: 

Parse the data once and put it in an SQLite db. Query using DBI.

Sinan Ünür
Yours and friedo's get the thumbs up from me.
Axeman
@Axeman Thank you.
Sinan Ünür
That's the way to go if you don't need write access.
brian d foy
Thanks for the suggestion; this is the way I went. Test results are posted in a separate answer.
Harold Bamford
+2  A: 

It's vastly preferable to not pull the entire list into memory every time you run the script. Using an on-disk database will allow you to do this. If, for some reason, you have to touch each entry in the CSV file every time you run, I might recommend storing it on a RAM disk instead of physical disk. It obviously fits in memory, I don't think you'll get much improvement by changing the on-disk format you store it in. The only way to really speed it up is store it on a faster medium.

bmdhacks
+11  A: 

The easiest way to put a very large hash on disk, IMHO, is with BerkeleyDB. It's fast, time-tested and rock-solid, and the CPAN module provides a tied API. That means you can continue using your hash as if it were an in-memory data structure, but it will automatically read and write through BerkeleyDB to disk.

friedo
+1  A: 

If you only need to access part of the data in each script, rather than ALL of it, DBM::Deep is probably your best bet.

Disk/IO is likely to be your biggest bottleneck no matter what you do. Perhaps you could use a data provider that keeps all the data available in a mmapped cache--using something like Sys::Mmap::Simple I've never needed to do this sort of thing, so I don't have much else to offer.

daotoad
Explain *why* DBM::Deep would be better for accessing part of the data, please?
ysth
DBM::Deep is a beautiful module: Think of it literally storing Perl data structures on disk without the need to deserialize the *whole* DB as with Storable. That being said, it's very, very slow if you need any significant fraction of the data. It puts convenience over performance.
tsee
+3  A: 

Well, I've taken the suggestion of Sinan Ünür (thanks!) and made an SQLite database and re-run my test program to compare getting the data via CSV files as compared to getting the data out of the SQLite data base:

$ c:/perl/bin/dprofpp.bat
Total Elapsed Time = 1705.947 Seconds
  User+System Time = 1084.296 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 19.5   212.2 212.26 893448   0.0002 0.0002  Text::CSV_XS::fields
 15.7   170.7 224.45    126   1.3549 1.7814  DBD::_::st::fetchall_hashref
 9.14   99.15 99.157 893448   0.0001 0.0001  Text::CSV_XS::Parse
 6.03   65.34 164.49 893448   0.0001 0.0002  Text::CSV_XS::parse
 4.93   53.41 53.412 893574   0.0001 0.0001  DBI::st::fetch
   [ *removed the items of less than 0.01 percent* ]

The total for CSV_XS is 34.67% as compared to 20.63% for SQLite which is somewhat better than the Storable solution I tried before. However, this isn't a fair comparison since with the CSV_XS solution I have to load the entire CSV file but with the SQLite interface, I can just load the parts I want. Thus in practice, I expect even more improvement than this simple-minded test shows.

I have not tried using BerkeleyDB (sorry, friedo) instead of SQLite, mostly because I didn't see that suggestion until I was well involved with trying out SQLite. Setting up the test was a non-trivial task since I almost never have occasion to use SQL databases.

Still, the solution is clearly to load all the data into a database and access via the DBI module. Thanks for everyone's help. All responses are greatly appreciated.

Harold Bamford
@Harold Thanks for accepting my answer, but, more importantly, thank you very much for a nice summary with actual numbers.
Sinan Ünür