views:

54

answers:

3

I'm new to Python and Sqlite, so I'm sure there's a better way to do this. I have a DB with 6000 rows, where 1 column is a 14K XML string. I wanted to compress all those XML strings to make the DB smaller. Unfortunately, the script below is much, much slower than this simple command line (which takes a few seconds).

sqlite3 weather.db .dump | gzip -c > backup.gz

I know it's not the same thing, but it does read/convert the DB to text and run gzip. So I was hoping this script would be within 10X performance, but it is more like 1000X slower. Is there a way to make the following script more efficient? Thanks.

import zlib, sqlite3

conn = sqlite3.connect(r"weather.db")
r = conn.cursor()
w = conn.cursor()
rows = r.execute("select date,location,xml_data from forecasts")
for row in rows:
    data = zlib.compress(row[2])
    w.execute("update forecasts set xml_data=? where date=? and location=?", (data, row[0], row[1]))

conn.commit()
conn.close()
+2  A: 

Not sure you can increase the performance by doing an update after the fact. there's too much overhead between doing the compress and updating the record. you won't gain any space savings unless you do a vacuum after you're done with the updates. the best solution would probably be to do the compress when the records are first inserted. then you get the space savings and the performance hit won't likely be as noticeable. if you can't do it on the insert, then i think you've explored the two possibilities and seen the results.

Don Dickinson
I'm asking this question to learn how to write better DB conversion scripts whenever I change the schema. The perf was terrible with just 6K rows, what will I do when I need to update 60K rows to a new schema? I've already run VACUUM and updated my script to compress on INSERT. It reduced space by 10X.
projectshave
as mentioned above (by Larry), wrapping the updates inside a transaction will definitely help. you have to experiment with that. perhaps do a begin transaction, add 10000 records, then do a commit. repeat every 10k records. compare that against doing it for every 1000 records, etc. until you find the best performance.
Don Dickinson
+2  A: 

You are comparing apples to oranges here. The big difference between the sqlite3|gzip and python version is that the later writes the changes back to the DB!

What sqlite3|gzip does is:

  • read the db
  • gzip the text

in addition to the above the python version writes the gzipped text back into the db with one UPDATE per read record.

Almir Karic
+1  A: 

Sorry, but are you implicitly starting a transaction in your code? If you're autocommitting after each UPDATE that will slow you down substantially.

Do you have an appropriate index on date and/or location? What kind of variation do you have in those columns? Can you use an autonumbered integer primary key in this table?

Finally, can you profile how much time you're spending in the zlib calls and how much in the UPDATEs? In addition to the database writes that will slow this process down, your database version involves 6000 calls (with 6000 initializations) of the zip algorithm.

Larry Lustig
Thanks for the info. RE: Tx, I put a commit after the 6K updates, but maybe it's doing an autocommit after each UPDATE. I'll check that. RE: indexes, I don't have any set (I should add one). RE: zlib, I don't see a way to reuse the zlib Compression object. I'll check that.
projectshave