ansaurus

Question

Python list serialization - fastest method

Answer 1

+2 A:

"how can one benchmark such things reliably?"

I don't get the question.

You write a bunch of little functions to create and save your list in various forms.

You write a bunch of little functions to load your lists in their various forms.

You write a little timer function to get start time, execute the load procedure a few dozen times (to get a solid average that's long enough that OS scheduling noise doesn't dominate your measurements).

You summarize your data in a little report.

What's unreliable about this?

Here are some unrelated questions that shows how to measure and compare performance.

http://stackoverflow.com/questions/489999/python-convert-list-of-ints-to-one-number

http://stackoverflow.com/questions/376461/string-concatenation-vs-string-substitution-in-python

S.Lott 2009-02-17 13:43:57

I agree. That's what I do.

jcoon 2009-02-17 13:48:00

How can I execute "import <filename>" several times in a loop if import is cached?

Eli Bendersky 2009-02-17 16:21:07

If your data set is big enough, one measurement may be all you need. If not, you can run from the command line in a shell loop and time that instead. Also, look at imp.load_module.

S.Lott 2009-02-17 16:37:02

Answer 2

+3 A:

For benchmarking, see the timeit module in the Python standard library. To see what is the fastest way, implement all the ways you can think of and measure them with timeit.

Random thought: depending on what you're doing exactly, you may find it fastest to store "sets of integers" in the style used in .newsrc files:

1, 3-1024, 11000-1200000

If you need to check whether something is in that set, then loading and matching with such a representation should be among the fastest ways. This assumes your sets of integers are reasonably dense, with long consecutive sequences of adjacent values.

Lars Wirzenius 2009-02-17 13:50:14

http://docs.python.org/library/timeit.html

Owen 2009-02-17 13:56:38

Answer 3

+1 A:

To help you with timing, the Python Library provides the timeit module:

This module provides a simple way to time small bits of Python code. It has both command line as well as callable interfaces. It avoids a number of common traps for measuring execution times.

An example (from the manual) that compares the cost of using hasattr() vs. try/except to test for missing and present object attributes:

% timeit.py 'try:' '  str.__nonzero__' 'except AttributeError:' '  pass'
100000 loops, best of 3: 15.7 usec per loop
% timeit.py 'if hasattr(str, "__nonzero__"): pass'
100000 loops, best of 3: 4.26 usec per loop
% timeit.py 'try:' '  int.__nonzero__' 'except AttributeError:' '  pass'
1000000 loops, best of 3: 1.43 usec per loop
% timeit.py 'if hasattr(int, "__nonzero__"): pass'
100000 loops, best of 3: 2.23 usec per loop

gimel 2009-02-17 13:57:14

Answer 4

+4 A:

I would guess cPickle will be fastest if you really need the thing in a list.

If you can use an array, which is a built-in sequence type, I timed this at a quarter of a second for 1 million integers:

from array import array
from datetime import datetime

def WriteInts(theArray,filename):
    f = file(filename,"wb")
    theArray.tofile(f)
    f.close()

def ReadInts(filename):
    d = datetime.utcnow()
    theArray = array('i')
    f = file(filename,"rb")
    try:
        theArray.fromfile(f,1000000000)
    except EOFError:
        pass
    print "Read %d ints in %s" % (len(theArray),datetime.utcnow() - d)
    return theArray

if __name__ == "__main__":
    a = array('i')
    a.extend(range(0,1000000))
    filename = "a_million_ints.dat"
    WriteInts(a,filename)
    r = ReadInts(filename)
    print "The 5th element is %d" % (r[4])

Carlos A. Ibarra 2009-02-17 14:07:06

'Read 1000000 ints in 0:00:03.500000', and it took 1/4 seconds for you?

Eli Bendersky 2009-02-17 17:43:25

however, you're right, array.fromfile is much faster than cpickle!!

Eli Bendersky 2009-02-17 17:49:45

@eliben - you might want to pick this as the best answer. The lessons on using the timeit module are popular, but they don't answer your question directly!

Greg Ball 2009-02-19 09:14:21

Answer 5

+1 A:

cPickle will be the fastest since it is saved in binary and no real python code has to be parsed.

Other advantates are that it is more secure (since it does not execute commands) and you have no problems with setting $PYTHONPATH correctly.

Johannes Weiß 2009-02-17 14:11:21

Answer 6

+2 A:

Do you need to always load the whole file? If not, upack_from() might be the best solution. Suppose, that you have 1000000 integers, but you'd like to load just the ones from 50000 to 50099, you'd do:

import struct
intSize = struct.calcsize('i') #this value would be constant for a given arch
intFile = open('/your/file.of.integers')
intTuple5K100 = struct.unpack_from('i'*100,intFile,50000*intSize)

vartec 2009-02-24 12:25:20

ansaurus

tags:

views:

answers:

Python list serialization - fastest method

related questions