tags:

views:

172

answers:

2

I would like to include data files with a Python package. Is the best place to put them inside the actual package as suggested here, i.e.

setup.py
src/
    mypkg/
        __init__.py
        module.py
        data/
            tables.dat
            spoons.dat
            forks.dat

or is there a better way to do this? What is the best way to retrieve a datafile from inside python? Should I use

mypkg.__path__ + 'data/tables.dat'

for example, or should I use

pkgutil.getdata('mypkg','tables.dat')

or again, is there another better way to do this?

Generally speaking, what is the current preferred way to deal with data inside Python packages?

A: 

You should store your data as a Python data structure vía the Pickle module. That way, when you call it (load it) the data is ready to be used, and you dont need to process it in every script.

As for the location, it makes sense that you store it in a way that is transparent and clear to the user, the following seems intuitive to me:

from package import data
Arrieta
That doesn't really answer the question, as the pickled data still needs to be in some sort of data file. And also "should" is a strong word here. You *could* store it as a pickle file, but that makes it hard to edit, for example. Often csv is better. Also, common data to store like this is image files. There is no reason to make pickles of them.
Lennart Regebro
@Lennart: I think you did not get the point of pickling data "that way when you load it is ready to use" I presume you understand what this means but, just in case, I'll explain: If you store as, say, a csv file, then you need to implement a reader and store each line in a list (let us say you need a list). If you pickle it, then you call the list directly, and you save the "create the list" step. As for the "hard to edit" part, well, it is data, right? If you need to edit it, just edit directly on the list and rewrite the pickle. Isn't that what data serialization is all about?
Arrieta
Pickling is not totally secure - http://nadiana.com/python-pickle-insecure. Also, it's Python specific and like Lennart said, not hand hackable (which is always a useful thing). You'd be better off using a language agnostic format like JSON. Also, this just talks about 'data formats' rather than 'data location' which is what the OP wants.
Noufal Ibrahim
I presume CSV is not the standard of security
Arrieta
CSV is more secure because it's completely data. If your application doesn't trust it, it' fine. Pickle files are evaluated on load and can be engineered to execute malicious code upon decoding. That's harder to detect.
Noufal Ibrahim
@Arrieta: I understood perfectly, and stand by my original comment as being 100% correct.
Lennart Regebro
CSV and Pickle are both (at some level) the same. They're serialisation formats. Ways of persisting data. Pickle is an insecure format since it's treated as code. Talk about "being ready to use" etc. are all implementation details and don't contribute to the issue. Also, your answer is not really addressing the question at hand. -1.
Noufal Ibrahim
+3  A: 

pkgutil means you can load the data even if the package is installed in a ZIP file, so it's preferable if you want to support that. Storing it in a data directory like that is fine, I do that all the time. :)

Lennart Regebro