views:

318

answers:

3

Current I use SQLite (w/ SQLAlchemy) to store about 5000 dict objects. Each dict object corresponds to an entry in PyPI with keys - (name, version, summary .. sometimes 'description' can be as big as the project documentation).

Writing these entries (from JSON) back to the disk (SQLite format) takes several seconds, and it feels slow.

Writing is done as frequent as once a day, but reading/searching for a particular entry based on a key (usually name or description) is done very often.

Just like apt-get.

Is there a storage library for use with Python that will suit my needs better than SQLite?

+2  A: 

Did you put indices on name and description? Searching on 5000 indexed entries should be essentially instantaneous (of course ORMs will make your life much harder, as they usually do [even relatively good ones such as SQLAlchemy, but try "raw sqlite" and it absolutely should fly).

Writing just the updated entries (again with real SQL) should also be basically instantaneous -- ideally a single update statement should do it, but even a thousand should be no real problem, just make sure to turn off autocommit at the start of the loop (and if you want turn it back again later).

Alex Martelli
Actually, it is writing those 5000 entries that are slow. And, I essentially I have to write them all, say, every day. Think of "apt-get update" with an updated package index ready to be downloaded.
Sridhar Ratnakumar
@srid, can you give the detailed CREATE TABLE statements? Looks like that must be peculiar indeed -- in my experiments it takes orders of magnitude longer to get info from pypi via xmlrpclib than it does to update it in sqlite (of course you only update in SQLite what's changed: why ever should you rewrite UNchanged entries?!).
Alex Martelli
warren
@warren, excellent point -- autocommit _may_ well imply substantial slowdown (I normally disable it first-thing for other reasons anyway, but it's true that performance is one more reason to keep it disabled;-).
Alex Martelli
+1  A: 

It might be overkill for your application, but you ought to check out schema-free/document-oriented databases. Personally I'm a fan of couchdb. Basically, rather than store records as rows in a table, something like couchdb stores key-value pairs, and then (in the case of couchdb) you write views in javascript to cull the data you need. These databases are usually easier to scale than relational databases, and in your case may be much faster, since you dont have to hammer your data into a shape that will fit into a relational database. On the other hand, it means that there is another service running.

Alex
Yes, couchdb is an overkill for an essentially standalone command-line application. :-)
Sridhar Ratnakumar
A: 

Given the approximate number of objects stated (around 5,000), SQLite is probably not the problem behind speed. It's the intermediary measures; for example JSON or possibly non-optimal use of SQLAlChemy.

Try this out (fairly fast even for million objects): y_serial.py module :: warehouse Python objects with SQLite

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."

http://yserial.sourceforge.net

The yserial search on your keys is done using the regular expression ("regex") code on the SQLite side, not Python, so there's another substantial speed improvement.

Let us know how it works out.

code43