So I've got a large database that I can't hold in memory at once. I've got to loop over every item in a table, process it, and put the processed data into another column in the table.
While I'm looping over my cursor, if I try to run an update statement it truncates the recordset (I believe because it's re-purposing the cursor object).
Questions:
Will creating a second cursor object to run the update statements allow me to continue looping over the original select statement?
Do I need a second connection to the database in order to have a second cursor object, that will allow me do do this?
How would sqlite respond to having two connections to the database, one reading from the table, the other writing to it?
My code (simplified):
import sqlite3
class DataManager():
""" Manages database (used below).
I cut this class way down to avoid confusion in the question.
"""
def __init__(self, db_path):
self.connection = sqlite3.connect(db_path)
self.connection.text_factory = str
self.cursor = self.connection.cursor()
def genRecordset(self, str_sql, subs=tuple()):
""" Generate records as tuples, for str_sql.
"""
self.cursor.execute(str_sql, subs)
for row in self.cursor:
yield row
select = """
SELECT id, unprocessed_content
FROM data_table
WHERE processed_content IS NULL
"""
update = """
UPDATE data_table
SET processed_content = ?
WHERE id = ?
"""
data_manager = DataManager(r'C:\myDatabase.db')
subs = []
for row in data_manager.genRecordset(str_sql):
id, unprocessed_content = row
processed_content = processContent(unprocessed_content)
subs.append((processed_content, id))
#every n records update the database (whenever I run out of memory)
if len(subs) >= 1000:
data_manager.cursor.executemany(update, subs)
data_manager.connection.commit()
subs = []
#update remaining records
if subs:
data_manager.cursor.executemany(update, subs)
data_manager.connection.commit()
The other method I tried was to modify my select statement to be:
select = """
SELECT id, unprocessed_content
FROM data_table
WHERE processed_content IS NULL
LIMIT 1000
"""
Then I would do:
recordset = data_manager.cursor.execute(select)
while recordset:
#do update stuff...
recordset = data_manager.cursor.execute(select)
The problem I had with this was that my real select statement has a JOIN in it and takes a while, so executing the JOIN that many times is very time intensive. I'm trying to speed up the process by only doing the select once, then using a generator so I don't have to hold it all in memory.
Solution:
Ok, so the answer to my first two questions is "No." To my third question, once a connection is made to a database, it locks the entire database, so another connection won't be able to do anything until the first connection is closed.
I couldn't find the source code for it, but from empirical evidence I believe that a connection can only use one cursor object at a time and the last run query takes precedence. This means that, while I'm looping over the selected recordset yielding one row at a time, as soon as I run my first update statement my generator stops yielding rows.
My solution is to create a temporary database that I stick the processed_content in with the id, so that I have one connection/cursor object per database and can continue looping over the selected recordset, while inserting into the temporary database periodically. Once I reach the end of my selected recordset I transfer the data in the temporary database back to the original.
If anyone knows for sure about the connection/cursor objects, let me know in a comment.