tags:

views:

1062

answers:

3

I need to run a couple of relatively simple SQL update statements to update a single column in an Oracle table with 14.4 million rows. One statement runs a function written in Java and the JVM runs out of memory as I’m doing the update on all 14.4 million rows.

Have you written a kind of batch PL/SQL routine that can break this simple update into sets of, say, 10K records per batch? I know that if I can commit my updates after a bunch of records, it’ll go a lot faster and I won’t run out of memory. I’m sure there’s a simple way to do this using a FOR loop and row_num but I'm not making much progress.

Here are the two statements I need to run for each batch of n records:

 update vr_location l set l.usps_address=(
   select mylib.string_utils.remove_duplicate_whitespace(
   house_number||' '||pre_street_direction||' '||street_name||' '||street_description||' '||post_street_direction)
 from vr_address a where a.address_pk=l.address_pk);

update vr_location set usps_address=mylib.usaddress_utils.parse_address(usps_address);
+1  A: 

Do an initial select to retrieve some kind of grouping attribute, so that you end up with groups that have the desired number of rows. Experiment with the grouping clause, for instance the last three digits of a zip-code or something semi random.

Loop over the grouping clause, using the clause as parameter to limit the rows targeted by each update statement. commit at the end of each iteration.

stili
If we could see the columns in vr_location, we may be able to suggest something.
Gary
The columns in VR_LOCATION do include a zip_co column which is an FK back to a zip_code table. I'm going to take stili's advice and use this to get subsets of data. It's less about trying to get everything done in one set-based statement, and more about getting my JVM to handle 20-30K rows at a time without crashing.
Dylan
A: 

You (or your DBA) should size the UNDO properly and do this as a single SQL transaction

The advantages are:

  • read consistency on table while this is happening
  • you retain the ability to rollback the transaction in case something fails

If you're in some sort of loading environment where you don't care about either of these, then use CTAS (create table as select) to make a new table with the modified value, build the indexes, constraints, etc and then swap the table names. 14 million rows isn't that big these days.

dpbradley
Doesn't sound like an UNDO problem, but with the java stored procedure not releasing memory.
Gary
Yep, it's that the JVM is killing my memory. Working on a laptop with 4GB RAM, half of which is used by oracle, and the remainder is split between OS, Java, etc. But I've mentioned this to my DBA. Also, this is not production: I've got to munge some data for exp/imp to a production table.
Dylan
A: 

Well, I had to get stuff done so I took your recommendations then did a little Python to do it. I ended up using cx_Oracle to give me good control over the transactions. Obviously PL/SQL would have been better but I don't know it. Python is my new hammer, and everything is a nail!

#!/usr/bin/env python
import csv
import time
import cx_Oracle

# Parses USPS addresses from voter addresses
# and inserts them into VR_LOCATION table ready
# for geocoding. Does batches by zipcode
def LoadZips():
    zipcodes = []
    zips = open('OH_ZIP_CODES.txt','r')
    for line in zips:
        zip = line[0:5]
        if zip not in zipcodes:
            zipcodes.append(zip)
    zips.close()
    return zipcodes

def UpdateAddresses(ziplist):
    counter = 1
    total = len(ziplist)

    for zipcode in ziplist:
        orcl = cx_Oracle.connect('voter/voter@oracle')
        curs = orcl.cursor()
        countsql = "select count(*) from vr_location where zip_co = '%s'" % zipcode
        concatsql = """update vr_location l set l.usps_address=(
                    select mizar.string_utils.remove_duplicate_whitespace(
                        house_number
                        ||' '||pre_street_direction
                        ||' '||street_name
                        ||' '||street_description
                        ||' '||post_street_direction)
                    from vr_address a where a.address_pk = l.address_pk)
                where zip_co = '%s'""" % zipcode
        parsesql = """update vr_location set usps_address =  mizar.usaddress_utils.parse_address(usps_address)
                where zip_co = '%s'""" % zipcode
        curs.execute(countsql)

        records_affected = curs.fetchone()[0]
    if records_affected == 0:
        print "No records for zipcode %s" % zipcode
        counter += 1 
        continue

    print "[%s] %s of %s: %s addresses" % (zipcode, counter, total, records_affected)
    curs.execute(concatsql)
    orcl.commit()
    curs.execute(parsesql)
    orcl.commit()
    curs.close()
    counter += 1 

    # Uncomment this to debug - just steps through X zipcodes      
    #if counter == 3:
    #    print "Cleaning up..."
    #    break


if __name__ == "__main__":
    start = time.clock()
    zipcodes = LoadZips()
    print "Processing addresses in %s zip codes" % len(zipcodes)
    UpdateAddresses(zipcodes)

Blog post on the topic

Dylan