views:

473

answers:

3

So I've got a large database that I can't hold in memory at once. I've got to loop over every item in a table, process it, and put the processed data into another column in the table.

While I'm looping over my cursor, if I try to run an update statement it truncates the recordset (I believe because it's re-purposing the cursor object).

Questions:

Will creating a second cursor object to run the update statements allow me to continue looping over the original select statement?

Do I need a second connection to the database in order to have a second cursor object, that will allow me do do this?

How would sqlite respond to having two connections to the database, one reading from the table, the other writing to it?

My code (simplified):

import sqlite3

class DataManager():
    """ Manages database (used below). 
        I cut this class way down to avoid confusion in the question.
    """
    def __init__(self, db_path):
        self.connection = sqlite3.connect(db_path)
        self.connection.text_factory = str
        self.cursor = self.connection.cursor()

    def genRecordset(self, str_sql, subs=tuple()):
        """ Generate records as tuples, for str_sql.
        """
        self.cursor.execute(str_sql, subs)
        for row in self.cursor:
            yield row

select = """
            SELECT id, unprocessed_content 
            FROM data_table 
            WHERE processed_content IS NULL
         """

update = """
            UPDATE data_table
            SET processed_content = ?
            WHERE id = ?
         """
data_manager = DataManager(r'C:\myDatabase.db')
subs = []
for row in data_manager.genRecordset(str_sql):
    id, unprocessed_content = row
    processed_content = processContent(unprocessed_content)
    subs.append((processed_content, id))

    #every n records update the database (whenever I run out of memory)
    if len(subs) >= 1000:
        data_manager.cursor.executemany(update, subs)
        data_manager.connection.commit()
        subs = []
#update remaining records
if subs:
    data_manager.cursor.executemany(update, subs)
    data_manager.connection.commit()

The other method I tried was to modify my select statement to be:

select = """
            SELECT id, unprocessed_content 
            FROM data_table 
            WHERE processed_content IS NULL
            LIMIT 1000
         """

Then I would do:

recordset = data_manager.cursor.execute(select)
while recordset:
    #do update stuff...
    recordset = data_manager.cursor.execute(select)

The problem I had with this was that my real select statement has a JOIN in it and takes a while, so executing the JOIN that many times is very time intensive. I'm trying to speed up the process by only doing the select once, then using a generator so I don't have to hold it all in memory.

Solution:

Ok, so the answer to my first two questions is "No." To my third question, once a connection is made to a database, it locks the entire database, so another connection won't be able to do anything until the first connection is closed.

I couldn't find the source code for it, but from empirical evidence I believe that a connection can only use one cursor object at a time and the last run query takes precedence. This means that, while I'm looping over the selected recordset yielding one row at a time, as soon as I run my first update statement my generator stops yielding rows.

My solution is to create a temporary database that I stick the processed_content in with the id, so that I have one connection/cursor object per database and can continue looping over the selected recordset, while inserting into the temporary database periodically. Once I reach the end of my selected recordset I transfer the data in the temporary database back to the original.

If anyone knows for sure about the connection/cursor objects, let me know in a comment.

+1  A: 

Cursors are bad bad bad for a multitude of reasons.

I'd suggest (and a lot of others will definitely chime in) that you use a single UPDATE statement instead of going the CURSOR route.

Can your Processed_Content be sent as a parameter to a single query that does set based operations like so:

UPDATE data_table
SET processed_content = ?
WHERE processed_content IS NULL
LIMIT 1000

Edited based on responses:

Since every row has a unique value for Processed_Content, you have no option but to use a recordset and a loop. I have done this in the past on multiple occasions. What you are suggesting should work effectively.

Raj More
Unfortunately no, as that would set every record in the table to the same thing. Each row/record of unprocessed_content is different (although there is no UNIQUE constraint on the table).
tgray
I'm chiming in ;-) Yes, many tasks can be expressed a simple or relatively simple SQL query, reserve cursors for other cases. With big databases, you may need to "chunk" the updated of say 5 million records over 50 smaller jobs, but this would still be less painful and certainly faster than cursor-ing around.
mjv
Would very much agree. If you can avoid a cursor based approach, then I'd definitely go for it. Most cursor based SQL can be re-written somehow in a set based fashion.
Paddy
@tgray: can this different value for the process_content column be phrased as a SQL expression ?
mjv
@mjv, I don't think so. unprocessed_content is HTML, processed_content is the list of human readable words in the HTML. Also, my database is currently around 12GB in size and expected to grow.
tgray
@Paddy, what do you mean by "set based fashion"?
tgray
I believe I need to use the Cursor() object in order to retrieve information from a database using python. I haven't seen an example that doesn't use it. http://www.devshed.com/c/a/Python/Using-SQLite-in-Python/3/
tgray
The Connection() object has the 'execute' methods, but not the 'fetch' methods.
tgray
Here's a link to the python documentation as well: http://docs.python.org/library/sqlite3.html
tgray
+1  A: 

I think you have roughly the right architecture -- presenting it in terms of "cursors" WILL confuse the "old SQL hands", because they'll be thinking of the many issues connected with DECLARE foo CURSOR, FETCH FROM CURSOR, WHERE CURRENT OF CURSOR, and other such beauts having to do with SQL cursors. Python DB API's "cursor" is simply a convenient way to package and execute SQL statements, not necessarily connected with SQL cursors -- it won't suffer from any of those problems -- though it may present its (completely original) own ones;-) But, with the "batching" of results you're doing, your proper commits, etc, you have preventively finessed most of those "original problems" I had in mind.

On some other engines I'd suggest doing first a select into a temporary table, then reading from that temporary table while updating the primary one, but I'm uncertain how the performance would be affected in sqlite, depending on what indices you have (if no index is affected by your update, then I suspect that such a temporary table would not be an optimization at all in sqlite -- but I can't run benchmarks on your data, the only real way to check performance hypotheses).

So, I'd say, go for it!-)

Alex Martelli
I tried creating a second connection to the database with it's own cursor object, but got this, "sqlite3.OperationalError: database is locked". Can this be gotten around just by creating the temporary table?
tgray
The reason I tried the connection/cursor first is because the temporary table will be close to 12GB worth of data (effectively doubling the size of my database while it runs).
tgray
Argh, you're right -- sqlite's lock is db-wide!-( The temp table will just help by doing the join once and for all, but then you'll have to use the select/limit (updating both temp and real table, or updating the real one and deleting from the temp by the same increment) -- doesn't sound fast, but I don't see alternatives.
Alex Martelli
Maybe a second temporary database file to store the processed_content with an id, then move the data over once it's finished? I don't know how hard it is to transfer data between databases.
tgray
At least if your DB isn't open, a single call to `shutil.copy` will copy the DB file, then you can read from the copy while updating the original and finally `os.unlink` the copy.
Alex Martelli
Does that work in Windows?
tgray
I don't think I need a full copy of the database. The processed_content is much smaller than the unprocessed_content, so writing to a new file while reading from the old, then transferring the processed content back will use less disk space.
tgray
@tgray, yes, shutil.copy works fine in Windows. But given the size ratio you mention, you're probably right about using a temporary file (if `processed_content`, despite being "much smaller", won't fit in memory -- now THAT would be fast!-) is likely a winner.
Alex Martelli
+1  A: 

Is it possible to create a DB function that will process your content? If so, you should be able to write a single update statement and let the database do all the work. Eg;

Update data_table
set processed_col = Process_Column(col_to_be_processed)
Rap
Unfortunately, my Process_Column function parses HTML and I don't know any good (easy) way of doing that in SQL. Do you? Good answer if my processing was a little less complicated.
tgray