views:

1021

answers:

6

Hello,

I need to load (de-serialize) a pre-computed list of integers from a file in a Python script (into a Python list). The list is large (upto millions of items), and I can choose the format I store it in, as long as loading is fastest.

Which is the fastest method, and why?

  1. Using import on a .py file that just contains the list assigned to a variable
  2. Using cPickle's load
  3. Some other method (perhaps numpy?)

Also, how can one benchmark such things reliably?

Addendum: measuring this reliably is difficult, because import is cached so it can't be executed multiple times in a test. The loading with pickle also gets faster after the first time probably because page-precaching by the OS. Loading 1 million numbers with cPickle takes 1.1 sec the first time run, and 0.2 sec on subsequent executions of the script.

Intuitively I feel cPickle should be faster, but I'd appreciate numbers (this is quite a challenge to measure, I think).

And yes, it's important for me that this performs quickly.

Thanks

+2  A: 

"how can one benchmark such things reliably?"

I don't get the question.

You write a bunch of little functions to create and save your list in various forms.

You write a bunch of little functions to load your lists in their various forms.

You write a little timer function to get start time, execute the load procedure a few dozen times (to get a solid average that's long enough that OS scheduling noise doesn't dominate your measurements).

You summarize your data in a little report.

What's unreliable about this?

Here are some unrelated questions that shows how to measure and compare performance.

http://stackoverflow.com/questions/489999/python-convert-list-of-ints-to-one-number

http://stackoverflow.com/questions/376461/string-concatenation-vs-string-substitution-in-python

S.Lott
I agree. That's what I do.
jcoon
How can I execute "import <filename>" several times in a loop if import is cached?
Eli Bendersky
If your data set is big enough, one measurement may be all you need. If not, you can run from the command line in a shell loop and time that instead. Also, look at imp.load_module.
S.Lott
+3  A: 

For benchmarking, see the timeit module in the Python standard library. To see what is the fastest way, implement all the ways you can think of and measure them with timeit.

Random thought: depending on what you're doing exactly, you may find it fastest to store "sets of integers" in the style used in .newsrc files:

1, 3-1024, 11000-1200000

If you need to check whether something is in that set, then loading and matching with such a representation should be among the fastest ways. This assumes your sets of integers are reasonably dense, with long consecutive sequences of adjacent values.

Lars Wirzenius
http://docs.python.org/library/timeit.html
Owen
+1  A: 

To help you with timing, the Python Library provides the timeit module:

This module provides a simple way to time small bits of Python code. It has both command line as well as callable interfaces. It avoids a number of common traps for measuring execution times.

An example (from the manual) that compares the cost of using hasattr() vs. try/except to test for missing and present object attributes:

% timeit.py 'try:' '  str.__nonzero__' 'except AttributeError:' '  pass'
100000 loops, best of 3: 15.7 usec per loop
% timeit.py 'if hasattr(str, "__nonzero__"): pass'
100000 loops, best of 3: 4.26 usec per loop
% timeit.py 'try:' '  int.__nonzero__' 'except AttributeError:' '  pass'
1000000 loops, best of 3: 1.43 usec per loop
% timeit.py 'if hasattr(int, "__nonzero__"): pass'
100000 loops, best of 3: 2.23 usec per loop
gimel
+4  A: 

I would guess cPickle will be fastest if you really need the thing in a list.

If you can use an array, which is a built-in sequence type, I timed this at a quarter of a second for 1 million integers:

from array import array
from datetime import datetime

def WriteInts(theArray,filename):
    f = file(filename,"wb")
    theArray.tofile(f)
    f.close()

def ReadInts(filename):
    d = datetime.utcnow()
    theArray = array('i')
    f = file(filename,"rb")
    try:
        theArray.fromfile(f,1000000000)
    except EOFError:
        pass
    print "Read %d ints in %s" % (len(theArray),datetime.utcnow() - d)
    return theArray

if __name__ == "__main__":
    a = array('i')
    a.extend(range(0,1000000))
    filename = "a_million_ints.dat"
    WriteInts(a,filename)
    r = ReadInts(filename)
    print "The 5th element is %d" % (r[4])
Carlos A. Ibarra
'Read 1000000 ints in 0:00:03.500000', and it took 1/4 seconds for you?
Eli Bendersky
however, you're right, array.fromfile is much faster than cpickle!!
Eli Bendersky
@eliben - you might want to pick this as the best answer. The lessons on using the timeit module are popular, but they don't answer your question directly!
Greg Ball
+1  A: 

cPickle will be the fastest since it is saved in binary and no real python code has to be parsed.

Other advantates are that it is more secure (since it does not execute commands) and you have no problems with setting $PYTHONPATH correctly.

Johannes Weiß
+2  A: 

Do you need to always load the whole file? If not, upack_from() might be the best solution. Suppose, that you have 1000000 integers, but you'd like to load just the ones from 50000 to 50099, you'd do:

import struct
intSize = struct.calcsize('i') #this value would be constant for a given arch
intFile = open('/your/file.of.integers')
intTuple5K100 = struct.unpack_from('i'*100,intFile,50000*intSize)
vartec