views:

158

answers:

8

Ok my Giant friends once again I seek a little space in your shoulders :P

Here is the issue, I have a python script that is fixing some database issues but it is taking way too long, the main update statement is this:

cursor.execute("UPDATE jiveuser SET username = '%s' WHERE userid = %d" % (newName,userId))

That is getting called about 9500 times with different newName and userid pairs...

Any suggestions on how to speed up the process? Maybe somehow a way where I can do all updates with just one query?

Any help will be much appreciated!

PS: Postgres is the db being used.

+2  A: 

You might want to look into executemany(): Information here

VeeArr
+3  A: 

First of all, do not use the % operator to construct your SQL. Instead, pass your tuple of arguments as the second parameter to cursor.execute, which also negates the need to quote your argument and allows you to use %s for everything:

cursor.execute("UPDATE jiveuser SET username = %s WHERE userid = %s", (newName, userId))

This is important to prevent SQL Injection attacks.

To answer your question, you can speed up these updates by creating an index on the userid column, which will allow the database to update in O(1) constant time rather than having to scan the entire database table, which is O(n). Since you're using PostgreSQL, here's the syntax to create your index:

CREATE INDEX username_lookup ON jiveuser (userid);

EDIT: Since your comment reveals that you already have an index on the userid column, there's not much you could possibly do to speed up that query. So your main choices are either living with the slowness, since this sounds like a one-time fix-something-broken thing, or following VeeArr's advice and testing whether cursor.executemany will give you a sufficient boost.

Eli Courtwright
Well I have an index for the userid which is what is in the where clause... would an index on the username really help?
hdx
@hdx: No, that was my mistake, I carelessly typed `username` rather than `userid`, and I didn't realize you already had an index on that column. I have just corrected my typo.
Eli Courtwright
Is it `%s` the placeholder per the Postgres DBAPI? I don't know, I'm asking, since the placeholder is typically `?`.
ΤΖΩΤΖΙΟΥ
@TZ: `%s` is the DB-API 2.0 standard placeholder; implementations may use placeholders such as `?` as well as `%s`, but my understanding is that `%s` is required to work for an implementation to be considered DB-API 2.0 compliant.
Eli Courtwright
+1  A: 

Perhaps you can create an index on userid to speed things up.

anijhaw
A: 

Move this to a stored procedure and execute it from the database self.

Jorge Niedbalski R.
well, by the time you've put the data onto the db in a temp table or similar, you don't need a server function.
araqnid
+4  A: 

Insert all the data into another empty table (called userchanges, say) then UPDATE in a single batch:

UPDATE jiveuser
SET username = userchanges.username
FROM userchanges
WHERE userchanges.userid = jiveuser.userid
    AND userchanges.username <> jiveuser.username

See this documentation on the COPY command for bulk loading your data.

There are also tips for improving performance when populating a database.

Cade Roux
Make that a temporary table and you don't have to bother about it after you're done.
Frank Heikens
@Cade Roux: I updated your statement a bit as PostgreSQL's UPDATE...FROM construct differs a little from what you wrote (but close enough that I only noticed it on the second pass...)
araqnid
Also, look into using COPY to push the data to the server, although it will need some driver-specific support to do that (usually called putcopydata or similar)
araqnid
So in order to create that temp table I'll need to run the same number of insert statements as I was running update statements before... Are inserts that much faster than updates done based on an indexed field?
hdx
@hdx On SQL Server, we would use a BULK INSERT / SqlBulkCopy to commit 9500 rows in a single insert. The UPDATE should commit 9500 changes in a single commit.
Cade Roux
I'd be interested to see a benchmark on this approach. I suspect that doing 9500 inserts into a new table followed by a bulk update would take longer than 9500 updates. After all, all data would then have to be processed twice.
Jay
@Jay, Like I said, do one insert of 9500 rows and one update of 9500 rows. Fewer large set operations are generally faster on RDBMS than many small set operations.
Cade Roux
@Cade: Oh, oh, you're assuming that your SQL dialect supports insert of multiple records with a single statement. Some don't, so I was thinking you would need multiple insert statements, and at that point it would become a question of the relative performance of a bunch of inserts followed by a mass update versus a bunch of updates.
Jay
@Jay Postgres supports COPY as well as disabling AUTOCOMMIT, but typically an entire batch operation needs a single transaction to ensure integrity. This question does not give the motivation as to why the database issues are needing fixing nor whether this needs to happen on a regular basis. If it didn't have to happen on a regular basis, I would argue that performance doesn't even matter and just "run and done". For production, get to the point (which may indeed take a while) where the database can quickly do the part which needs ACID, in order to reduce effective locking time.
Cade Roux
+1  A: 

I'd do an explain on this. If it's doing an indexed lookup to find the record -- which it should if you have an index on userid -- then I don't see what you could do to improve performance. If it's not using the index, then the trick is figuring out why not and fixing it.

Oh, you could try using a prepared statement. With 9500 inserts, that should help.

Jay
+3  A: 

The reason it's taking so long is probably that you've got autocommit enabled and each update gets done in its own transaction.

This is slow because even if you have a battery-backed raid controller (which you should definitely have on all database servers, of course), it still needs to do a write into that device for every transaction commit to ensure durability.

The solution is to do more than one row per transaction. But don't make transactions TOO big or you run into problems too. Try committing every 10,000 rows of changes as a rough guess.

MarkR
A: 

First ensure you have an index on 'userid', this will ensure the dbms doesn't have to do a table scan each time

CREATE INDEX jiveuser_userid ON jiveuser (userid);

Next try preparing the statement, and then calling execute on it. This will stop the optimizer from having to examine the query each time

PREPARE update_username(string,integer) AS UPDATE jiveuser SET username = $1 WHERE userid = $2;
EXECUTE update_username("New Name", 123);

Finally, a bit more performance could be squeezed out by turning off autocommit

\set autocommit off
W Devauld