views:

997

answers:

6

What is the best way to store large amounts of data in python, given one (or two) 500,000 item+ dictionary used for undirected graph searching?

I've been considering a few options such as storing the data as XML:

<key name="a">
    <value data="1" />
    <value data="2" />
</key>
<key name="b">
...

or in a python file for direct access:

db = {"a": [1, 2], "b": ...}

or in a SQL database? I'm thinking this would be the best solution, but would I have to rely more on SQL to do the computation than python itself?

+2  A: 

I would consider using one of the many graph libraries available for python (e.g. python-graph)

mac
+1  A: 

You need to specify your problem a bit better. I'll make a few assumptions: 1) your data is static and you just want to search it, 2) you have enough memory to store it.

If application startup speed is not critical, the data format is up to you, just as long as you can get it into Python memory. Use simple data types (dicts, lists, strings) to store data, not an XML graph, if you want to access it quickly. You might consider writing a lightweight class of your own to express nodes and store links to other nodes in a dict or array.

If application startup time is critical, consider loading your data in a Python program and pickling it out to a file; you can then load the pickled data structure (which should be really fast) in the production application.

If, on the other hand, your data is too big to fit in memory, or you want to be able to modify it persistently, you could use SQL for storage (either an external server or an SQLite database) or ZODB (a Python object database).

Alex Morega
A: 

If you store your data on an XML file it will be easier to modify (i.e. using notepad ...) but you must take in account that reading and parsing all that amount of data from a XML file is a heavy duty. Using a SQL database (maybe PostGres) will make the choiche some more performant, DMBS are more optimized than direct filesystem reading/parsing. If you store all your data in some Python structure on a separate file, you can than have the advantage of bytecode compilation (.pyc) that doesn't give a boost in computational therms but allows for a faster load (wich is what you want). I would choose the last one.

Stefano Driussi
A: 

XML is really oriented to tree structures and is very verbose. You can look at RDF for ways to describe a graph in XML but it still has other disadvantages, e.g. the time to read, parse, and instantiate 500k+ objects and the amount of file space used.

SQL is really oriented to describing rows in tables. You can store graphs of course, but you'll see a performance penalty here too.

I would try python pickling first to see if it meets your needs. It will probably be the most compact and the fastest to read in and instantiate all the objects.

Really the only reason to use other formats is if you need something they offer, e.g. transactions in SQL or cross-language processing of XML.

Van Gale
A: 

The python file approach will surely be the fastest if you have a way to maintain the file.

Vasil
+4  A: 

The Python source technique absolutely rules.

XML is slow to parse, and relatively hard to read by people. That's why companies like Altova are in business -- XML isn't pleasant to edit.

Python source db = {"a": [1, 2], "b": ...} is

  1. Fast to parse.

  2. Easy to read by people.

If you have programs that read and write giant dictionaries, use pprint for the writing so that you get a nicely formatted output. Something easier to read.

If you're worried about portability, consider YAML (or JSON) for serializing the object. They parse quickly also, and are much, much easier to read than XML.

S.Lott