ansaurus

Question

Saving huge bigram dictionary to file using pickle

Answer 1

+7 A:

Pickle is only meant to write complete (small) objects. Your dictionary is a bit large to even hold in memory, you'd better use a database instead so you can store and retrieve entries one by one instead of all at once.

Some good and easily integratable singe-file database formats you can use from Python are SQLite or one of the BDM variantes. The last one acts just like a dictionary (i.e. you can read and write key/value-pairs) but uses the disk as storage rather than 1.2 GBs of memory.

Wim 2010-01-21 10:11:50

Sqlite is a fully relational database, while Berkeley DB is not, just key/value. If it's just storing, I think Berkeley is a better option, while if you want to make some queries and store the information in more organized way, sqlite it's more appropiate.

Khelben 2010-01-21 10:37:22

BerkeleyDB is rather fickle and difficult to manage, especially with larger amounts of data. Even for a single string->string store (which is what BerkeleyDB would be) I would use SQLite, which will take care of all the BerkeleyDB management.

Thomas Wouters 2010-01-21 11:33:03

SQLite does not act like a dictionary.

Thomas Wouters 2010-01-21 11:33:42

The Python page for the bsddb moddule (http://www.python.org/doc/2.6/library/bsddb.html) says that it is deprecated. Is there another non-deprecated Python option for a BSD DB?

Jeff 2010-01-21 13:07:56

http://www.python.org/doc/2.6/library/persistence.html lists a number of data persistence modules. The `gdbm` module looks very similar and still supported, I'd go for that one.

Wim 2010-01-21 13:25:16

your database suggestion was adequate. although he had to use mysql because SQLite just wasn't cutting it.

João Portela 2010-01-25 17:45:40

Answer 2

A:

Do you really need the whole data in memory? You could split it in naive ways like one file for each year o each month if you want the dictionary/pickle approach.

Also, remember that the dictionaries are not sorted, you can have problems having to sort that ammount of data. In case you want to search or sort the data, of course...

Anyway, I think that the database approach commented before is the most flexible one, specially on the long run...

Khelben 2010-01-21 10:41:50

Answer 3

A:

If your really, really want to use a dictionary like semantics, try SQLAlchemy's associationproxy. The following (rather long) piece of code translates your dictionary into Key,Value-Pairs in the entries-Table. I do not know how SQLAlchemy copes with your big dictionary, but SQLite should be able to handle it nicely.

from sqlalchemy import create_engine, MetaData
from sqlalchemy import Table, Column, Integer, ForeignKey, Unicode, UnicodeText
from sqlalchemy.orm import mapper, sessionmaker, scoped_session, Query, relation
from sqlalchemy.orm.collections import column_mapped_collection
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.schema import UniqueConstraint

engine = create_engine('sqlite:///newspapers.db')

metadata = MetaData()
metadata.bind = engine

Session = scoped_session(sessionmaker(engine))
session = Session()

newspapers = Table('newspapers', metadata,
    Column('newspaper_id', Integer, primary_key=True),
    Column('newspaper_name', Unicode(128)),
)

entries = Table('entries', metadata,
    Column('entry_id', Integer, primary_key=True),
    Column('newspaper_id', Integer, ForeignKey('newspapers.newspaper_id')),
    Column('entry_key', Unicode(255)),
    Column('entry_value', UnicodeText),
    UniqueConstraint('entry_key', 'entry_value', name="pair"),
)

class Base(object):

    def __init__(self, **kw):
        for key, value in kw.items():
            setattr(self, key, value)

    query = Session.query_property(Query)

def create_entry(key, value):
    return Entry(entry_key=key, entry_value=value)

class Newspaper(Base):

    entries = association_proxy('entry_dict', 'entry_value',
        creator=create_entry)

class Entry(Base):
    pass

mapper(Newspaper, newspapers, properties={
    'entry_dict': relation(Entry,
        collection_class=column_mapped_collection(entries.c.entry_key)),
})
mapper(Entry, entries)

metadata.create_all()

dictionary = {
    u'foo': u'bar',
    u'baz': u'quux'
}

roll = Newspaper(newspaper_name=u"The Toilet Roll")
session.add(roll)
session.flush()

roll.entries = dictionary
session.flush()

for entry in Entry.query.all():
    print entry.entry_key, entry.entry_value
session.commit()

session.expire_all()

print Newspaper.query.filter_by(newspaper_id=1).one().entries

gives

foo bar
baz quux
{u'foo': u'bar', u'baz': u'quux'}

pi 2010-01-21 14:20:12

Answer 4

+1 A:

One solution is to use buzhug instead of pickle. It's a pure Python solution, and retains very Pythonic syntax. I think of it as the next step up from shelve and their ilk. It will handle the data sizes you're talking about. Its size limit is 2 GB per field (each field is stored in a separate file).

Ryan Ginstrom 2010-01-21 14:52:56

ansaurus

tags:

views:

answers:

Saving huge bigram dictionary to file using pickle

related questions