views:

87

answers:

2

I'm looking for a more efficient way to run many columns updates on the same table like this:

UPDATE TABLE table
SET col = regexp_replace( col, 'foo', 'bar' )
WHERE regexp_match( col, 'foo' );

Such that foo, and bar, will be a combination of 40 different regex-replaces. I doubt even 25% of the dataset needs to be updated at all, but what I'm wanting to know is it is possible to cleanly achieve the following in SQL.

  • A single pass update
  • A single match of the regex, triggers a single replace
  • Not running all possible regexp_replaces if only one matches
  • Not updating all columns if only one needs the update
  • Not updating a row if no column has changed

I'm also curious, I know in MySQL (bear with me)

UPDATE foo SET bar = 'baz'

Has an implicit WHERE bar != 'baz' clause

However, in Postgresql I know this doesn't exist: I think I could at least answer one of my questions if I knew how to skip a single row's update if the target columns weren't updated.

Something like

UPDATE TABLE table
SET col = *temp_var* = regexp_replace( col, 'foo', 'bar' )
WHERE col != *temp_var*
+4  A: 

Do it in code. Open up a cursor, then: grab a row, run it through the 40 regular expressions, and if it changed, save it back. Repeat until the cursor doesn't give you any more rows.

Whether you do it that way or come up with the magical SQL expression, it's still going to be a row scan of the entire table, but the code will be much simpler.

Experimental Results

In response to criticism, I ran an experiment. I inserted 10,000 lines from a documentation file into a table with a serial primary key and a varchar column. Then I tested two ways to do the update. Method 1:

in a transaction:
  opened up a cursor (select for update)
  while reading 100 rows from the cursor returns any rows:
    for each row:
      for each regular expression:
        do the gsub on the text column
      update the row

This takes 1.16 seconds with a locally connected database.

Then the "big replace," a single mega-regex update:

update foo set t = regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(t, E'\bcommit\b', E'COMMIT'), E'\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b', E'9ACF10762B5F3D3B1B33EA07792A936A25E45010'), E'\bAuthor:\b', E'AUTHOR:'), E'\bCarl\b', E'CARL'), E'\bWorth\b', E'WORTH'), E'\b\b', E''), E'\bDate:\b', E'DATE:'), E'\bMon\b', E'MON'), E'\bOct\b', E'OCT'), E'\b26\b', E'26'), E'\b04:53:13\b', E'04:53:13'), E'\b2009\b', E'2009'), E'\b-0700\b', E'-0700'), E'\bUpdate\b', E'UPDATE'), E'\bversion\b', E'VERSION'), E'\bto\b', E'TO'), E'\b2.9.1\b', E'2.9.1'), E'\bcommit\b', E'COMMIT'), E'\b61c89e56f361fa860f18985137d6bf53f48c16ac\b', E'61C89E56F361FA860F18985137D6BF53F48C16AC'), E'\bAuthor:\b', E'AUTHOR:'), E'\bCarl\b', E'CARL'), E'\bWorth\b', E'WORTH'), E'\b\b', E''), E'\bDate:\b', E'DATE:'), E'\bMon\b', E'MON'), E'\bOct\b', E'OCT'), E'\b26\b', E'26'), E'\b04:51:58\b', E'04:51:58'), E'\b2009\b', E'2009'), E'\b-0700\b', E'-0700'), E'\bNEWS:\b', E'NEWS:'), E'\bAdd\b', E'ADD'), E'\bnotes\b', E'NOTES'), E'\bfor\b', E'FOR'), E'\bthe\b', E'THE'), E'\b2.9.1\b', E'2.9.1'), E'\brelease.\b', E'RELEASE.'), E'\bThanks\b', E'THANKS'), E'\bto\b', E'TO'), E'\beveryone\b', E'EVERYONE'), E'\bfor\b', E'FOR')

The mega-regex update takes 0.94 seconds to update.

At 0.94 seconds compared to 1.16, it's true that the mega-regex update is faster, running in 81% of the time of doing it in code. It is not, however a lot faster. And ye Gods, look at that update statement. Do you want to write that, or try to figure out what went wrong when Postgres complains that you dropped a parenthesis somewhere?

Code

The code used was:

  def stupid_regex_replace
    sql = Select.new
    sql.select('id')
    sql.select('t')
    sql.for_update
    sql.from(TABLE_NAME)
    Cursor.new('foo', sql, {}, @db) do |cursor|
      until (rows = cursor.fetch(100)).empty?
        for row in rows
          for regex, replacement in regexes
            row['t'] = row['t'].gsub(regex, replacement)
          end
        end
        sql = Update.new(TABLE_NAME, @db)
        sql.set('t', row['t'])
        sql.where(['id = %s', row['id']])
        sql.exec
      end
    end
  end

I generated the regular expressions dynamically by taking words from the file; for each word "foo", its regular expression was "\bfoo\b" and its replacement string was "FOO" (the word uppercased). I used words from the file to make sure that replacements did happen. I made the test program spit out the regex's so you can see them. Each pair is a regex and the corresponding replacement string:

[[/\bcommit\b/, "COMMIT"],
 [/\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b/,
  "9ACF10762B5F3D3B1B33EA07792A936A25E45010"],
 [/\bAuthor:\b/, "AUTHOR:"],
 [/\bCarl\b/, "CARL"],
 [/\bWorth\b/, "WORTH"],
 [/\b<[email protected]>\b/, "<[email protected]>"],
 [/\bDate:\b/, "DATE:"],
 [/\bMon\b/, "MON"],
 [/\bOct\b/, "OCT"],
 [/\b26\b/, "26"],
 [/\b04:53:13\b/, "04:53:13"],
 [/\b2009\b/, "2009"],
 [/\b-0700\b/, "-0700"],
 [/\bUpdate\b/, "UPDATE"],
 [/\bversion\b/, "VERSION"],
 [/\bto\b/, "TO"],
 [/\b2.9.1\b/, "2.9.1"],
 [/\bcommit\b/, "COMMIT"],
 [/\b61c89e56f361fa860f18985137d6bf53f48c16ac\b/,
  "61C89E56F361FA860F18985137D6BF53F48C16AC"],
 [/\bAuthor:\b/, "AUTHOR:"],
 [/\bCarl\b/, "CARL"],
 [/\bWorth\b/, "WORTH"],
 [/\b<[email protected]>\b/, "<[email protected]>"],
 [/\bDate:\b/, "DATE:"],
 [/\bMon\b/, "MON"],
 [/\bOct\b/, "OCT"],
 [/\b26\b/, "26"],
 [/\b04:51:58\b/, "04:51:58"],
 [/\b2009\b/, "2009"],
 [/\b-0700\b/, "-0700"],
 [/\bNEWS:\b/, "NEWS:"],
 [/\bAdd\b/, "ADD"],
 [/\bnotes\b/, "NOTES"],
 [/\bfor\b/, "FOR"],
 [/\bthe\b/, "THE"],
 [/\b2.9.1\b/, "2.9.1"],
 [/\brelease.\b/, "RELEASE."],
 [/\bThanks\b/, "THANKS"],
 [/\bto\b/, "TO"],
 [/\beveryone\b/, "EVERYONE"],
 [/\bfor\b/, "FOR"]]

If this were a hand-generated list of regex's, and not automatically generated, my question is still appropriate: Which would you rather have to create or maintain?

Wayne Conrad
That's likely to give *horribly* bad performance. PostgreSQL is not an ISAM database.
Magnus Hagander
@Magnus, The performance will be horrible but not worse: Postgres doesn't use indices for regular expressions unless they are left-anchored, so it'll be a row scan no matter what.. The best you will get away with is a single row scan instead of 40, and the way to do that is to operate on each row once, applying all 40 regex's for each row. The only difference is whether you make Postgres do the 40 regex replaces on each row-scanned row or have code do it.
Wayne Conrad
It will still be a lot faster to do it one query doing a table scan, than doing it in an ISAM style loop. In most cases, doing 40 regexes is a lot cheaper than the cost of an ISAM loop.
Magnus Hagander
I like the update actually, but I will argue that it isn't a fair representation of code clarity when in one example you omit all but the regex, and in the other example you simple have two keywords and a regex.
Evan Carroll
@Evan, I had a little trouble getting that long statement to format decently. Did you see one of my intermediate attempts, or is it busted as you see it now? In any case, I'll add the actual code for comparison, but it's not much to look at.
Wayne Conrad
Ok I'm the last person to recommend a cursor usually, but I have to agree with this one. It isn't a row-by-row update (which would take forever givena decent amount of rows, but a loop through each regex expression you want to update tables for and then a set-based update statement. So the loop is only for 40 or so cycles not a million. I think this is much easier to deal with than one giant regex expression or writing 40 separate update statments. To me this is the kind of thing that cursors are for.
HLGEM
"Programs should be written for people to read, and only incidentally for machines to execute."
rjohnston
+1  A: 

For the skip update, look at suppress_redundant_updates - see http://www.postgresql.org/docs/8.4/static/functions-trigger.html.

This is not necessarily a win - but it might well be in your case.

Or perhaps you can just add that implicit check as an explicit one?

Magnus Hagander