views:

57

answers:

2

I have a database table that I need to import a bunch of records into. I don't want to overwrite existing records if they're already in the database, so I've set things up to do a select query first to check if there's a value, but apparently the rows are importing too quickly for the index to keep up, as I'm getting duplicates created for every single row that I'm inserting.

I'm importing a CSV file.

Here's what I'm doing (this is inside a Joomla system, so some of the code and objects are joomla-specific):

$fp = fopen(JPATH_ROOT.DS."tmp".DS.$filename, 'r');
//run insert query on each line of file
if(JRequest::getVar('importType')=="activated") {
  while(!feof($fp)) {
       while (($data = fgetcsv($fp, 1000, ",")) !== FALSE) {
             if($this->checkUnique($data[0])) {
                    $this->runInsert2($data[0], $data[1], $data[2], $data[3]);
                    error_log("there is not already a code for ".$data[0]);
             }
             else {
                    error_log("there is already a code for ".$data[0]);
             }
       $row++;
       }
 }
}

fclose($fp);

Here's checkUnique:

function checkUnique($vouchNum) {

        $db =& JFactory::getDBO();

        $query = "select COUNT(*)  from arrc_Voucher where VoucherNbr=".$db->quote($vouchNum);

        if(!$db->query()) error_log("error running unique check on ".$vouchNum." - " . $db->stderr());

        $db->setQuery($query);

        $count = $db->loadResult();

        if($count>0) {

            return false;

        }

        else {

            return true;

        }

    }

And here's runInsert2:

 function runInsert2($vouchNum,$BalanceInit,$BalanceCurrent,$ActivatedDT) {

    $rightNow = date('Y-m-d H:i:s');

    $db =& JFactory::getDBO();



            if($ActivatedDT <> "NULL") {

                $activatedDTtmp = strtotime($ActivatedDT);

                $activatedDT = date('Y-m-d H:i:s',$activatedDTtmp);

            }

            else {

                $activatedDT = $rightNow;

            }


    $query = "insert into arrc_Voucher (VoucherNbr,BalanceInit, BalanceCurrent, ActivatedDT) 
              values (". $db->quote($vouchNum). ", ".$db->quote($BalanceInit).",".$db->quote($BalanceCurrent).",".$db->quote($activatedDT).")";
    error_log("query: ".$query);

    $db->setQuery($query);

    if (!$db->query()) error_log("error inserting voucher number ". $vouchNum . "-" . $db->stderr());

}

I have no clue where I'm going wrong here, but if anyone can help me out (or point me in a better direction for avoiding duplicates) I'd be very grateful. FYI, the field that we're considering to be "unique" (VoucherNbr) is not actually a primary key or in any way marked as unique in the table structure, and cannot be. This is something we need to work around on the coding end right now.

A: 

If you really can't change the tables, you might have to either check for duplicates and remove them after your INSERT, or lock the table before your check for existing rows. You can't guarantee an INSERT doesn't happen between your SELECT and INSERT statements.

Joshua Martell
+2  A: 

Put an unique constraint and use insert ignore, this way you'll never have duplicates. That is if it's ok for the duplicate rows to be ignored.

What is the reason you can't set an unique key on a column than needs to keep unique values?

Another solution would be to import data in a separate table with the same structure.

create table arrc_buffer like arrc_Voucher .

You truncate this table before each import.

Then you can insert into your arrc_Voucher table from this buffer.

1. Remove from the buffer all rows that are already in arrc_Voucher.

delete arrc_buffer b
from arrc_buffer b
inner join arrc_Voucher v on b.VoucherNbr = v.VoucherNbr;

Then insert the rest in arrc_Voucher.

insert into arrc_Voucher
select * from arrc_buffer

Besides these imports, is there any other routine inserting data in arrc_Voucher ?

ceteras
+1 for the buffer table. And you could INSERT in one statement rather than running the DELETE first, with something like "INSERT INTO arrc_voucher SELECT x,y,z FROM arrc_buffer WHERE VoucherNbr NOT IN (SELECT VoucherNbr FROM arrc_voucher)"
richaux
@richaux, select where someting not in () is slow, I try to avoid that. It isn't always better(faster) to put it in one statement. Sometimes I prefer to split the jobs in several statements when each of them can run very fast, and the whole process could go smoother.
ceteras
@ceteras - yes, good point. Maybe a `WHERE NOT EXISTS`? ... or several statements ;)
richaux