ansaurus

Question

Do I need multiple cursor objects to loop over a recordset and update at the same time?

Answer 1

+1 A:

Cursors are bad bad bad for a multitude of reasons.

I'd suggest (and a lot of others will definitely chime in) that you use a single UPDATE statement instead of going the CURSOR route.

Can your Processed_Content be sent as a parameter to a single query that does set based operations like so:

UPDATE data_table
SET processed_content = ?
WHERE processed_content IS NULL
LIMIT 1000

Edited based on responses:

Since every row has a unique value for Processed_Content, you have no option but to use a recordset and a loop. I have done this in the past on multiple occasions. What you are suggesting should work effectively.

Raj More 2009-09-22 20:55:13

Unfortunately no, as that would set every record in the table to the same thing. Each row/record of unprocessed_content is different (although there is no UNIQUE constraint on the table).

tgray 2009-09-22 21:03:31

I'm chiming in ;-) Yes, many tasks can be expressed a simple or relatively simple SQL query, reserve cursors for other cases. With big databases, you may need to "chunk" the updated of say 5 million records over 50 smaller jobs, but this would still be less painful and certainly faster than cursor-ing around.

mjv 2009-09-22 21:04:23

Would very much agree. If you can avoid a cursor based approach, then I'd definitely go for it. Most cursor based SQL can be re-written somehow in a set based fashion.

Paddy 2009-09-22 21:04:45

@tgray: can this different value for the process_content column be phrased as a SQL expression ?

mjv 2009-09-22 21:06:26

@mjv, I don't think so. unprocessed_content is HTML, processed_content is the list of human readable words in the HTML. Also, my database is currently around 12GB in size and expected to grow.

tgray 2009-09-22 21:10:44

@Paddy, what do you mean by "set based fashion"?

tgray 2009-09-22 21:17:55

I believe I need to use the Cursor() object in order to retrieve information from a database using python. I haven't seen an example that doesn't use it. http://www.devshed.com/c/a/Python/Using-SQLite-in-Python/3/

tgray 2009-09-22 21:22:50

The Connection() object has the 'execute' methods, but not the 'fetch' methods.

tgray 2009-09-22 21:23:45

Here's a link to the python documentation as well: http://docs.python.org/library/sqlite3.html

tgray 2009-09-22 21:24:56

Answer 2

+1 A:

I think you have roughly the right architecture -- presenting it in terms of "cursors" WILL confuse the "old SQL hands", because they'll be thinking of the many issues connected with DECLARE foo CURSOR, FETCH FROM CURSOR, WHERE CURRENT OF CURSOR, and other such beauts having to do with SQL cursors. Python DB API's "cursor" is simply a convenient way to package and execute SQL statements, not necessarily connected with SQL cursors -- it won't suffer from any of those problems -- though it may present its (completely original) own ones;-) But, with the "batching" of results you're doing, your proper commits, etc, you have preventively finessed most of those "original problems" I had in mind.

On some other engines I'd suggest doing first a select into a temporary table, then reading from that temporary table while updating the primary one, but I'm uncertain how the performance would be affected in sqlite, depending on what indices you have (if no index is affected by your update, then I suspect that such a temporary table would not be an optimization at all in sqlite -- but I can't run benchmarks on your data, the only real way to check performance hypotheses).

So, I'd say, go for it!-)

Alex Martelli 2009-09-23 04:16:26

I tried creating a second connection to the database with it's own cursor object, but got this, "sqlite3.OperationalError: database is locked". Can this be gotten around just by creating the temporary table?

tgray 2009-09-23 14:45:31

The reason I tried the connection/cursor first is because the temporary table will be close to 12GB worth of data (effectively doubling the size of my database while it runs).

tgray 2009-09-23 14:48:18

Argh, you're right -- sqlite's lock is db-wide!-( The temp table will just help by doing the join once and for all, but then you'll have to use the select/limit (updating both temp and real table, or updating the real one and deleting from the temp by the same increment) -- doesn't sound fast, but I don't see alternatives.

Alex Martelli 2009-09-23 15:05:39

Maybe a second temporary database file to store the processed_content with an id, then move the data over once it's finished? I don't know how hard it is to transfer data between databases.

tgray 2009-09-23 16:00:38

At least if your DB isn't open, a single call to `shutil.copy` will copy the DB file, then you can read from the copy while updating the original and finally `os.unlink` the copy.

Alex Martelli 2009-09-24 01:38:44

Does that work in Windows?

tgray 2009-09-24 15:10:41

I don't think I need a full copy of the database. The processed_content is much smaller than the unprocessed_content, so writing to a new file while reading from the old, then transferring the processed content back will use less disk space.

tgray 2009-09-24 15:11:28

@tgray, yes, shutil.copy works fine in Windows. But given the size ratio you mention, you're probably right about using a temporary file (if `processed_content`, despite being "much smaller", won't fit in memory -- now THAT would be fast!-) is likely a winner.

Alex Martelli 2009-09-25 01:07:38

Answer 3

+1 A:

Is it possible to create a DB function that will process your content? If so, you should be able to write a single update statement and let the database do all the work. Eg;

Update data_table
set processed_col = Process_Column(col_to_be_processed)

Rap 2009-09-24 19:14:56

Unfortunately, my Process_Column function parses HTML and I don't know any good (easy) way of doing that in SQL. Do you? Good answer if my processing was a little less complicated.

tgray 2009-09-25 12:24:44

ansaurus

tags:

views:

answers:

Do I need multiple cursor objects to loop over a recordset and update at the same time?

related questions