views:

311

answers:

2

I have a set of data points, each described by a dictionary. The processing of each data point is independent and I submit each one as a separate job to a cluster. Each data point has a unique name, and my cluster submission wrapper simply calls a script that takes a data point's name and a file describing all the data points. That script then accesses the data point from the file and performs the computation.

Since each job has to load the set of all points only to retrieve the point to be run, I wanted to optimize this step by serializing the file describing the set of points into an easily retrievable format.

I tried using JSONpickle, using the following method, to serialize a dictionary describing all the data points to file:

def json_serialize(obj, filename, use_jsonpickle=True):
    f = open(filename, 'w')
    if use_jsonpickle:
    import jsonpickle
    json_obj = jsonpickle.encode(obj)
    f.write(json_obj)
    else:
    simplejson.dump(obj, f, indent=1)   
    f.close()

The dictionary contains very simple objects (lists, strings, floats, etc.) and has a total of 54,000 keys. The json file is ~20 Megabytes in size.

It takes ~20 seconds to load this file into memory, which seems very slow to me. I switched to using pickle with the same exact object, and found that it generates a file that's about 7.8 megabytes in size, and can be loaded in ~1-2 seconds. This is a significant improvement, but it still seems like loading of a small object (less than 100,000 entries) should be faster. Aside from that, pickle is not human readable, which was the big advantage of JSON for me.

Is there a way to use JSON to get similar or better speed ups? If not, do you have other ideas on structuring this?

(Is the right solution to simply "slice" the file describing each event into a separate file and pass that on to the script that runs a data point in a cluster job? It seems like that could lead to a proliferation of files).

thanks.

+1  A: 

I think you are facing a trade-off here: human-readability comes at the cost of performance and large file size. Thus, of all the serialization methods available in Python, JSON is not only the most readable, it is also the slowest.

If I had to pursue performance (and file compactness), I'd go for marshall. You can either marshal the whole set with dump() and load() or, building on your idea of slicing things up, marshal separate parts of the data set into separate files. This way you open the door for parallelization of the data processing -- if you feel so inclined.

Of course, there are all kinds of restrictions and warnings in the documentation, so if you decide to play it safe, go for pickle.

kaloyan
JSON is slower than XML? That doesn't seem right...
TM
+3  A: 

marshal is fastest, but pickle per se is not -- maybe you mean cPickle (which is pretty fast, esp. with a -1 protocol). So, apart from readability issues, here's some code to show various possibilities:

import pickle
import cPickle
import marshal
import json

def maked(N=5400):
  d = {}
  for x in range(N):
    k = 'key%d' % x
    v = [x] * 5
    d[k] = v
  return d
d = maked()

def marsh():
  return marshal.dumps(d)

def pick():
  return pickle.dumps(d)

def pick1():
  return pickle.dumps(d, -1)

def cpick():
  return cPickle.dumps(d)

def cpick1():
  return cPickle.dumps(d, -1)

def jso():
  return json.dumps(d)

def rep():
  return repr(d)

and here are their speeds on my laptop:

$ py26 -mtimeit -s'import pik' 'pik.marsh()'
1000 loops, best of 3: 1.56 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.pick()'
10 loops, best of 3: 173 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.pick1()'
10 loops, best of 3: 241 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.cpick()'
10 loops, best of 3: 21.8 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.cpick1()'
100 loops, best of 3: 10 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.jso()'
10 loops, best of 3: 138 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.rep()'
100 loops, best of 3: 13.1 msec per loop

so, you can have readability and ten times the speed of json.dumps with repr (you sacrifice the ease of parsing from Javascript and other languages); you can have the absolute maximum speed with marshal, almost 90 times faster than json; cPickle offers way more generality (in terms of what you can serialize) than either json or marshal, but if you're never going to use that generality then you might as well go for marshal (or repr if human readability trumps speed).

As for your "slicing" idea, in lieu of a multitude of files, you might want to consider a database (a multitude of records) -- you might even get away without actual serialization if you're running with data that has some recognizable "schema" to it.

Alex Martelli
Thanks so much for your informative reply, that was very helpful. Which data bases would you recommend in python? I would very much prefer things that don't require stand alone database servers -- or even better that are built in to python, maybe like sqlite -- over those that do. Any thoughts on this? Would a database approach in Python rival the pickle times for the test case of a dictionary for ~50,000 keys where you have to slice a particular entry from it?If I switch to a DB, I'll write custom code to serialize into CSV so that my files can be shared and read by other human users.
If you use an embedded DB then sqlite is best, but like any other embedded DB it doesn't buy you any parallel processing, the big performance strength of the DB approach in this case. How hard is it to run a PostgreSQL process, after all? And NOW you get perfect parallelization of data access, and big performance boost. (Writing CSV or other forms to a SQL DB, and dumping the DB's content back to any form of your liking, is an easy job with simple auxiliary scripts, of course -- that's independendent of what DB engine you choose).
Alex Martelli