views:

346

answers:

6

I have a csv file that has 3.5 million codes in it.
I should point out that this is only EVER going to be this once.

The csv looks like

age9tlg,  
rigfh34,  
...

Here is my code:

ini_set('max_execution_time', 600);
ini_set("memory_limit", "512M");
$file_handle = fopen("Weekly.csv", "r");
while (!feof($file_handle)) {
    $line_of_text = fgetcsv($file_handle);

    if (is_array($line_of_text))
        foreach ($line_of_text as $col) {
            if (!empty($col)) {
                mysql_query("insert into `action_6_weekly` Values('$col', '')") or die(mysql_error());
            }
    } else {
        if (!empty($line_of_text)) {
            mysql_query("insert into `action_6_weekly` Values('$line_of_text', '')") or die(mysql_error());
        }
    }
}
fclose($file_handle);

Is this code going to die part way through on me? Will my memory and max execution time be high enough?

NB: This code will be run on my localhost, and the database is on the same PC, so latency is not an issue.


Update:
here is another possible implementation. This one does it in bulk inserts of 2000 records

$file_handle = fopen("Weekly.csv", "r");
$i = 0;
$vals = array();
while (!feof($file_handle)) {
    $line_of_text = fgetcsv($file_handle);

    if (is_array($line_of_text))
        foreach ($line_of_text as $col) {
            if (!empty($col)) {
                if ($i < 2000) {
                    $vals[] = "('$col', '')";
                    $i++;
                } else {
                    $vals = implode(', ', $vals);
                    mysql_query("insert into `action_6_weekly` Values $vals") or die(mysql_error());
                    $vals = array();
                    $i = 0;
                }
            }
        } else {
        if (!empty($line_of_text)) {
            if ($i < 2000) {
                $vals[] = "('$line_of_text', '')";
                $i++;
            } else {
                $vals = implode(', ', $vals);
                mysql_query("insert into `action_6_weekly` Values $vals") or die(mysql_error());
                $vals = array();
                $i = 0;
            }
        }
    }
}
fclose($file_handle);

if i was to use this method what is the highest value i could set it to insert at once?


Update 2
so, ive found i can use

LOAD DATA LOCAL INFILE  'C:\\xampp\\htdocs\\weekly.csv' INTO TABLE  `action_6_weekly` FIELDS TERMINATED BY  ';' ENCLOSED BY  '"' ESCAPED BY  '\\' LINES TERMINATED BY  ','(`code`)

but the issue now is that, i was wrong about the csv format, it is actually 4 codes and then a line break, so fhroflg,qporlfg,vcalpfx,rplfigc,
vapworf,flofigx,apqoeei,clxosrc,
...

so i need to be able to specify two LINES TERMINATED BY
this question has been branched out to Here.


Update 3
Setting it to do bulk inserts of 20k rows, using

while (!feof($file_handle)) {
   $val[] = fgetcsv($file_handle);
   $i++;
   if($i == 20000) {
      //do insert
      //set $i = 0;
      //$val = array();
   }
}

//do insert(for last few rows that dont reach 20k

but it dies at this point because for some reason $val contains 75k rows, and idea why?
note the above code is simplified.

+2  A: 

is this code going to die part way through on me? will my memory and max execution time be high enough?

Why don't you try and find out?

You can adjust both the memory (memory_limit) and execution time (max_execution_time) limits, so if you really have to use that, it shouldn't be a problem.

Note that MySQL supports delayed and multiple row insertion:

INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);

http://dev.mysql.com/doc/refman/5.1/en/insert.html

NullUserException
+19  A: 

I doubt this will be the popular answer, but I would have your php application run mysqlimport on the csv file. Surely it is optimized far beyond what you will do in php.

Ah the classic "I'm going to be downvoted for this". The tactic works with me: +1.
Artefacto
And is it possible to set two Lines Terminated By?considering this maps directly into Load Data In File
Hailwood
A: 

You should accumulate the values and insert them into the database all at once at the end, or in batches every x records. Doing a single query for each row means 3.5 million SQL queries, each carrying quite some overhead.

Also, you should run this on the command line, where you won't need to worry about execution time limits.

The real answer though is evilclown's answer, importing to MySQL from CSV is already a solved problem.

deceze
the csv is not in the right format, it is missing some columns.
Hailwood
@Hailwood That shouldn't be a problem, see the examples in the manual: `LOAD DATA INFILE 'persondata.txt' INTO TABLE persondata (col1,col2,...);`
deceze
would you mind telling me what i need to write?the csv is in the format ofcode, (newline)code, (newline)code, (newline)code, (newline)but i need to insert in the format of ('', code, 0)
Hailwood
Wouldn't it be quicker to manipulate the csv into something useful and import that.
Keyo
@Hailwood Apply a default value of `''` and `0` to the first and third column respectively (in the table definition) and only insert the second column with `... INTO TABLE foo (second_column_name)`.
deceze
please see update 2 above.
Hailwood
+2  A: 

I hope there is not a web client waiting for a response on this. Other than calling the import utility already referenced, I would start this as a job and return feedback to the client almost immediately. Have the insert loop update a percentage-complete somewhere so the end user can check the status, if you absolutely must do it this way.

A: 

2 possible ways.

1) Batch the process, then have a scheduled job import the file, while updating a status. This way, you can have a page that keeps checking the status and refresh itself if the status is not yet 100%. Users will have a live update of how much has been done. But for this you need to access to the OS to be able to set up the schedule task. And the task will be running idle when there is nothing to import.

2) Have the page handle 1000 rows (or any N number of rows... you decide), then send a java script to the browser to refresh itself with a new parameter to tell the script to handle the next 1000 rows. You can also display a status to the user while this is happening. Only problem is that if the page somehow does nor refresh, then the import stops.

iWantSimpleLife
+1  A: 
  1. make sure there are no indexes on your table, as indexes will slow down inserts (add the indexes after you've done all the inserts)
  2. rather than create a new SQL statement in each call of the loop try and Prepare the SQL statement outside of the loop, and Execute that prepared statement with parameters inside the loop. Depending on the database this can be heaps faster.

I've done the above when importing a large Access database into Postgres using perl and got the insert time down to 30 seconds. I would have used an importer tool, but I wanted perl to enforce some rules when inserting.

Matthew Lock
you can also disable indexes
Cfreak