views:

416

answers:

7

I'm considering the idea of creating a persistent storage like a dbms engine, what would be the benefits to create a custom binary format over directly cPickling the object and/or using the shelve module?

+2  A: 

Note that not all objects may be directly pickled - only basic types, or objects that have defined the pickle protocol.
Using your own binary format would allow you to potentially store any kind of object.

Just for note, Zope Object DB (ZODB) is following that very same approach, storing objects with the Pickle format. You may be interested in getting their implementations.

Roberto Liffredo
pickle can handle most user-defined classes without extra code. you only have to define a special handling of the pickle protocol for some cases.
Nelson
+8  A: 

Pickling is a two-face coin.

On one side, you have a way to store your object in a very easy way. Just four lines of code and you pickle. You have the object exactly as it is.

On the other side, it can become a compatibility nightmare. You cannot unpickle objects if they are not defined in your code, exactly as they were defined when pickled. This strongly limits your ability to refactor the code, or rearrange stuff in your modules. Also, not everything can be pickled, and if you are not strict on what gets pickled and the client of your code has full freedom of including any object, sooner or later it will pass something unpicklable to your system, and the system will go boom.

Be very careful about its use. there's no better definition of quick and dirty.

Stefano Borini
+1 for the issue regarding refactoring
Roberto Liffredo
+1  A: 

The potential advantages of a custom format over a pickle are:

  • you can selectively get individual objects, rather than having to incarnate the full set of objects
  • you can query subsets of objects by properties, and only load those objects that match your criteria

Whether these advantages materialize depends on how you design the storage, of course.

Martin v. Löwis
+2  A: 

One reason to define your own custom binary format could be optimization. pickle (and shelve, which uses pickle) is a generic serialization framework; it can store almost any Python data. It's easy to use pickle in a lot of situations, but it takes time to inspect all the objects and serialize their data and the data itself is stored in a generic, verbose format. If you are storing specific known data a custom-built serializer can be both faster and more concise.

It takes 37 bytes to pickle an object with a single integer value:

>>> import pickle
>>> class Foo: pass... 
>>> foo = Foo()
>>> foo.x = 3
>>> print repr(pickle.dumps(foo))
"(i__main__\nFoo\np0\n(dp1\nS'x'\np2\nI3\nsb."

Embedded in that data is the name of the property and its type. A custom serializer for Foo (and Foo alone) could dispense with that and just store the number, saving both time and space.

Another reason for a custom serialization framework is you can easily do custom validation and versioning of data. If you change your object types and need to load an old version of data it can be tricky via pickle. Your own code can be easily customized to handle older data formats.

In practice, I'd build something using the generic cPickle module and only replace it if profiling indicated it was really important. Maintaining a separate serialization framework is a significant amount of work.

One final resource you may find useful: some synthetic serializer benchmarks. cPickle is pretty fast.

Nelson
+1  A: 

If you are going to do that (implement your own binary format), you should first know that python has a good library to handle HDF5, a binary format used in physics and astronomy to dump huge amounts of data.

This is the home page of the library:

Basically, you could think of HDF5 as an hierarchical database, in which a table column can contain an inner table by itself: the table Populations has a column called Individual, which is a table containing the informations of every individuals, etc...

PyTables has also its own implementation of the cPickle module, you can access it with:

$ easy_install tables
$ python
>>> import tables
>>> tables.cPickle

I have never used pytable's pickle, but I think it may be straightforward for you to learn how does it work, so you may have a look at it before implementing your own format.

dalloliogm
A: 

Will you ever need to process data from untrusted sources? If so, you should know that the pickle format is actually a virtual machine that is capable of executing arbitrary code on behalf of the process doing the unpickling.

Marius Gedminas
+1  A: 

See this solution at SourceForge:

y_serial.py module :: warehouse Python objects with SQLite

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."

http://yserial.sourceforge.net

[The commentary included with the source endnotes discusses why pickle was selected over json.]

code43
if it use pickle it's not safe for a web based project, that's it ?
phmr
y_serial only unpickles trusted pickles created by its own functions, thus it's safe. You should read the endnotes in the module itself which gives a detailed explanation.
code43