views:

1020

answers:

2

Hi... When using LOAD DATA INFILE, is there a way to either flag a duplicate row, or dump any/all duplicates into a separate table?

+5  A: 

From the LOAD DATE INFILE documentation:

The REPLACE and IGNORE keywords control handling of input rows that duplicate existing rows on unique key values:

  • If you specify REPLACE, input rows replace existing rows. In other words, rows that have the same value for a primary key or unique index as an existing row. See Section 12.2.7, “REPLACE Syntax”.
  • If you specify IGNORE, input rows that duplicate an existing row on a unique key value are skipped. If you do not specify either option, the behavior depends on whether the LOCAL keyword is specified. Without LOCAL, an error occurs when a duplicate key value is found, and the rest of the text file is ignored. With LOCAL, the default behavior is the same as if IGNORE is specified; this is because the server has no way to stop transmission of the file in the middle of the operation.

Effectively, there's no way to redirect the duplicate records to a different table. You'd have to load them all in, and then create another table to hold the non-duplicated records.

OMG Ponies
I wonder if it would be possible to create a BEFORE INSERT trigger that queries the table to see if the ID of the insert exists, and then inserts it into another table if it does. It'd be slow as heck, but it could work ... (BEFORE / AFTER INSERT triggers gets called for every row in a LOAD DATA query).
cmptrgeekken
A: 

It looks as if there actually is something you can do when it comes to duplicate rows for LOAD DATA calls. However, the approach that I've found isn't perfect: it acts more as a log for all deletes on a table, instead of just for LOAD DATA calls. Here's my approach:

Table test:

CREATE TABLE test (
    id INTEGER PRIMARY KEY,
    text VARCHAR(255) DEFAULT NULL
);

Table test_log:

CREATE TABLE test_log (
    id INTEGER, -- not primary key, we want to accept duplicate rows
    text VARCHAR(255) DEFAULT NULL,
    time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Trigger del_chk:

delimiter //
drop trigger if exists del_chk;
CREATE TRIGGER del_chk AFTER DELETE ON test
FOR EACH ROW
BEGIN
    INSERT INTO test_log(id,text) values(OLD.id,OLD.text);
END;//
delimiter ;

Test import (/home/user/test.csv):

1,asdf
2,jkl
3,qwer
1,tyui
1,zxcv
2,bnm

Query:

LOAD DATA INFILE '/home/ken/test.csv'
REPLACE INTO TABLE test 
FIELDS 
    TERMINATED BY ','
LINES
    TERMINATED BY '\n' (id,text);

Running the above query will result in 1,asdf, 1,tyui, and 2,jkl being added to the log table. Based on a timestamp, it could be possible to associate the rows with a particular LOAD DATA statement.

cmptrgeekken