views:

182

answers:

2

I'm unpickling a NetworkX object that's about 1GB in size on disk. Although I saved it in the binary format (using protocol 2), it is taking a very long time to unpickle this file---at least half an hour. The system I'm running on has plenty of system memory (128 GB), so that's not the bottleneck.

I've read here that pickling can be sped up by first reading the entire file into memory, and then unpickling it (that particular thread refers to python 3.0, which I'm not using, but the point should still be true in python 2.6).

How do I first read the binary file, and then unpickle it? I have tried:

import cPickle as pickle
f = open("big_networkx_graph.pickle","rb")
bin_data = f.read()
graph_data = pickle.load(bin_data)

But this returns:

TypeError: argument must have 'read' and 'readline' attributes

Any ideas?

+1  A: 

pickle.load(file) expects a file-like object. Instead, use:

pickle.loads(string)

Read a pickled object hierarchy from a string. Characters in the string past the pickled object’s representation are ignored.

gimel
That appears to be working. I didn't think that work because "loads" means "load string," whereas the data that I am loading is binary. But because I read the file in binary, the string that I feed to it is also in binary, so it all works out. Thanks.
conradlee
@conradlee: Python strings support binary data, so there's no need to make that distinction.
unwind
A: 

The documentation mentions StringIO, which I think is one possible solution.

Try:

f = open("big_networkx_graph.pickle","rb")
bin_data = f.read()
sio = StringIO(bin_data)
graph_data = pickle.load(sio)
unwind
This works, but it because it require importing StringIO, it's more complicated than gimel's solution---that's why I'm giving him credit with the answer.
conradlee