views:

91

answers:

2

I am looking for a way to only insert when the row does not exist in MySQL, and update when the row exists AND the version of the existing row is less than (or equal to) the version of the new row.

For example, the table is defined as:

CREATE TABLE documents (
  id VARCHAR(64) NOT NULL,
  version BIGINT UNSIGNED NOT NULL,
  data BLOB,
  PRIMARY KEY (id)
);

And contains the following data:

id  version  data
----------------------------
1   3        first data set
2   2        second data set
3   5        third data set

And I want to merge the following table (UPDATE: id column is unique):

id  version  data
----------------------------
1   4        updated 1st
3   3        updated 2nd
4   1        new 4th

And it should produce the following (UPDATE: see how only 1 is updated and 4 is inserted):

id  version  data
----------------------------
1   4        updated 1st
2   2        second data set
3   5        third data set
4   1        new 4th

I've looked at INSERT ... ON DUPLICATE KEY UPDATE ... statement, but it doesn't allow for some sort of WHERE clause. Also, I can't really use REPLACE because it also does not allow WHERE. Is this even possible with a single MySQL statement?

I am using Java and am trying to possible insert/update many records using the PreparedStatement with batching (addBatch). Any help would be appreciated.

UPDATE: Is there any way to use this query with the PreparedStatement in Java? I have a List of Document objects with id, version, and data.

+2  A: 

I think INSERT ON DUPLICATE KEY UPDATE is your best bet. You can use it like

INSERT INTO table1 SELECT * FROM table2 ON DUPLICATE KEY UPDATE table1.data=IF(table1.version > table2.version, table1.data, table2.data), table1.version=IF(table1.version > table2.version, table1.version, table2.version)

Untested syntax, but I belive the idea should work.

Todd Gardner
I was using 2 tables as an example, but I don't really have 2 tables. I have a list of Document objects which contain the fields: id, version, and data. I'm trying to figure out the syntax for doing something like this via PreparedStatements in Java.
Mohamed Nuur
Then strip out the table2 references and put in ?s where you insert the values.
Todd Gardner
@Todd, I think your answer is the closest to what I am actually looking for. I will set up a test of your answer and Daniel's answer to see which one yields better performance results.
Mohamed Nuur
@Mohamed: You're right, I think this is faster. I ignored the fact that `REPLACE` drops the row before re-inserting it.
Daniel Vassallo
@Daniel, it seems like there's no point testing the performance since I also have to add the time it takes to create a temporary table and populate it and inner join it with the main table.
Mohamed Nuur
If you are concerned with performance, an optimization for bulk INSERTs is to group them; e.g., "INSERT INTO table1 VALUES (?, ?, ?) (?, ?, ?) (?, ?, ?) ..." (I've found that 20+ values at once improve performance drastically). If you use the VALUES statement, you could group the statement I put above: http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_values
Todd Gardner
Thank you very much Todd Gardner, you are indeed the Man! This is EXACTLY what I've been looking for.
Mohamed Nuur
+2  A: 

EDIT: In my earlier answer I suggested that a unique constraint is needed on (id, version), but actually this is not necessary. Your unique constraint on id only is enough for the solution to work.


You should be able to use the REPLACE command as follows:

REPLACE INTO main 
SELECT  IFNULL(m.id, s.id) id, 
        IFNULL(m.version, s.version) version, 
        IFNULL(m.data, s.data) data
FROM       secondary s
LEFT JOIN  main m ON (m.id = s.id AND m.version > s.version);

Test case:

CREATE TABLE main ( 
   id int, 
   version int, 
   data varchar(50), 
   PRIMARY KEY (id)
);

CREATE TABLE secondary (id int, version int, data varchar(50));

INSERT INTO main VALUES (1, 3, 'first data set');
INSERT INTO main VALUES (2, 2, 'second data set');
INSERT INTO main VALUES (3, 5, 'third data set');

INSERT INTO secondary VALUES (1, 4, 'updated 1st');
INSERT INTO secondary VALUES (3, 3, 'udated 2nd');
INSERT INTO secondary VALUES (4, 1, 'new 4th');

Result:

SELECT * FROM main;
+----+---------+-----------------+
| id | version | data            |
+----+---------+-----------------+
|  1 |       4 | updated 1st     |
|  2 |       2 | second data set |
|  3 |       5 | third data set  |
|  4 |       1 | new 4th         |
+----+---------+-----------------+
4 rows in set (0.00 sec)

As a side-note, to help you understand what's happening in that REPLACE command, note the following:

SELECT     s.id s_id, s.version s_version, s.data s_data,
           m.id m_id, m.version m_version, m.data m_data
FROM       secondary s
LEFT JOIN  main m ON (m.id = s.id AND m.version > s.version);

+------+-----------+-------------+------+-----------+----------------+
| s_id | s_version | s_data      | m_id | m_version | m_data         |
+------+-----------+-------------+------+-----------+----------------+
|    1 |         4 | updated 1st | NULL |      NULL | NULL           |
|    3 |         3 | udated 2nd  |    3 |         5 | third data set |
|    4 |         1 | new 4th     | NULL |      NULL | NULL           |
+------+-----------+-------------+------+-----------+----------------+
3 rows in set (0.00 sec)

Then the IFNULL() functions were taking care of "overwriting" the latest version from the main table if one was present, as in the case of id=3, version=5. Therefore the following:

SELECT  IFNULL(m.id, s.id) id, 
        IFNULL(m.version, s.version) version, 
        IFNULL(m.data, s.data) data
FROM       secondary s
LEFT JOIN  main m ON (m.id = s.id AND m.version > s.version);

+------+---------+----------------+
| id   | version | data           |
+------+---------+----------------+
|    1 |       4 | updated 1st    |
|    3 |       5 | third data set |
|    4 |       1 | new 4th        |
+------+---------+----------------+
3 rows in set (0.00 sec)

The above result set contains only records from the secondary table, but if any of these records happen to have a newer version in the main table, then the row is overwritten by the data from the main table. This is the input that we are feeding the REPLACE command.

Daniel Vassallo
Ok Daniel, one more question. How would I do this same query without needing a secondary table?
Mohamed Nuur
@Mohamed: If not, where would the data to merge be held?
Daniel Vassallo
@Mohamed: You may want to [create a temporary](http://dev.mysql.com/doc/refman/5.1/en/create-table.html) secondary table: `CREATE TEMPORARY TABLE secondary (id int, ... );`... Then `INSERT` the data from your application to the temporary secondary table, and apply the `REPLACE` function. Temporary tables are dropped automatically when the session is terminated.
Daniel Vassallo
@Daniel, suppose I was to have a `List<Document>` with `id`, `version`, and `data` members and I wanted to use `PreparedStatement.addBatch()`.. I wanted a simple statement that did (`if not exists then insert ... else if orig.version <= new.version then update ... end`) for a list of 100's of documents
Mohamed Nuur
@Mohamed: Ok understood... What do you think about inserting the data into a temporary table, as suggested in my previous comment?
Daniel Vassallo
I'm not so sure about temporary table performance. I will basically be inserting something in the area of 3MM rows... I don't know if temporary table will be able to handle it. But I'll set up a performance test to see if it will be acceptable. Thanks!
Mohamed Nuur
@Daniel, also I found that the REPLACE does a DELETE operation. This might be bad for me since I will be inserting/updating so many rows..
Mohamed Nuur