views:

269

answers:

4

It's taking me up to an hour to read a 1-gigabyte NetworkX graph data structure using cPickle (its 1-GB when stored on disk as a binary pickle file).

Note that the file quickly loads into memory. In other words, if I run:

import cPickle as pickle

f = open("bigNetworkXGraph.pickle","rb")
binary_data = f.read() # This part doesn't take long
graph = pickle.loads(binary_data) # This takes ages

How can I speed this last operation up?

Note that I have tried pickling the data both in using both binary protocols (1 and 2), and it doesn't seem to make much difference which protocol I use. Also note that although I am using the "loads" (meaning "load string") function above, it is loading binary data, not ascii-data.

I have 128gb of RAM on the system I'm using, so I'm hoping that somebody will tell me how to increase some read buffer buried in the pickle implementation.

A: 

why don't you use pickle.load?

f = open('fname', 'rb')
graph = pickle.load(f)
SilentGhost
That probably won't help; the reading part fast enough, and there is enough memory, so unpickling directly from the stream won't gain much.
wump
That's the first thing I tried. I show the more complicated way of loading a pickle file to illustrate that reading the binary data into ram does not seem be be the bottleneck.
conradlee
A: 

Maybe the best thing you can do is to split the big data into smallest object smaller, let's say, than 50MB, so can be stored in ram, and recombine it.

Afaik there's no way to automatic splitting data via pickle module, so you have to do by yourself.

Anyway, another way (which is quite harder) is to use some NoSQL Database like MongoDB to store your data...

Enrico Carlesso
he has 128GB of RAM, why would he do all the splitting?
SilentGhost
I guess he wants to write 128 MB of ram...
Enrico Carlesso
No, I mean 128 GB - it's a big machine. I've also got 24 cores to use, so a parallel solution would be nice, although I guess the GIL will not really make this possible.
conradlee
Wooops! BIG machine :) Sorry for misunderstood!
Enrico Carlesso
+2  A: 

You're probably bound by Python object creation/allocation overhead, not the unpickling itself. If so, there is little you can do to speed this up, except not creating all the objects. Do you need the entire structure at once? If not, you could use lazy population of the data structure (for example: represent parts of the structure by pickled strings, then unpickle them only when they are accessed).

wump
Alright, I can think of some ways to break up this data to some extent, (putting node attributes in different files), but the edges of the graph object alone take up a lot of memory---and if I have to store these in different files, and re-populate my graph every time I use it, then what's the point of serialization? I might as well just store my graph data in an edge-list.
conradlee
I indeed don't think serialization is the best solution for your problem. Pickle was never meant to be scalable for huge data structures. This is more the realm of database-like formats that supporting random-access and on demand loading.
wump
+1  A: 

Why don't you try marshaling your data and storing it in RAM using memcached (for example). Yes, it has some limitations but as this points out marshaling is way faster (20 to 30 times) than pickling.

Of course, you should also spend as much time optimizing your data structure in order to minimize the amount and complexity of data you want stored.

kaloyan