views:

56

answers:

4

Background:

I am parsing a 330 meg xml file into a DB (netflix catalog) using PHP script from the console.

I can successfully add about 1,500 titles every 3 seconds until i addd the logic to add actors, genre and formats. These are separate tables linked by an associative table.

right now I have to run many, many queries for each title, in this order ( i truncate all tables first, to eliminate old titles, genres, etc)

  1. add new title to 'titles' and capture insert id
  2. check actor table for exising actor
  3. if present, get id, if not insert actor and get insert id
  4. insert title id and actor id into associative table

(steps 2-4 are repeated for genres too)

This drops my speed don to about 10 per 3 seconds. which would take eternitty to add the ~250,00 titles.

so how would I combine the 4 queries into a single query, without adding duplicate actors or genres

My goal is to just write all queries into a data file, and do a bulk insert.

I started by writing all associative queries into a data file, but it didn't do much for performance.


I start by inserting th etitle, and saving ID

function insertTitle($nfid, $title, $year){
    $query="INSERT INTO ".$this->titles_table." (nf_id, title, year ) VALUES ('$nfid','$title','$year')";
    mysql_query($query);
    $this->updatedTitleCount++;
    return mysql_insert_id();
}

that is then used in conjunction with each actor's name to create the association

function linkActor($value, $title_id){
    //check if we already know value
    $query="SELECT * FROM ".$this->persons_table." WHERE person = '$value' LIMIT 0,1";
    //echo "<br>".$query."<br>";
    $result=mysql_query($query);
    if($result && mysql_num_rows($result) != 0){
     while ($row = mysql_fetch_assoc($result)) {
      $value_id=$row['id'];
     }
    }else{
     //no value known, add to persons table
     $query="INSERT INTO ".$this->persons_table." (person) VALUES ('$value')";
     mysql_query($query);
     $value_id=mysql_insert_id();

    } 
    //echo "linking title:".$title_id." with rel:".$value_id;
    $query = " INSERT INTO ".$this->title_persons_table." (title_id,person_id) VALUE ('$title_id','$value_id');";
    //mysql_query($query);
    //write query to data file to be read in bulk style
    fwrite($this->fh, $query);
}
A: 

Can you use a language other than PHP? If not, are you running this as a PHP stand-alone script or through a webserver? The webserver is probably adding a lot of overhead you don't need.

I do something very similar at work, using Python, and can insert a couple thousand rows (with associative table lookups) per second on your standard 3.4 GHz, 3GB RAM, machine. MySQL database isn't hosted locally but within the LAN.

Michael Wales
Thanks Michael , I added the details above. Unfortunately I am not very proficient with python. And I am running the script from the command line to avoid apache's overhead.
Eddie
A: 

I cannot give you the exact query for your task, but have a look at INSERT ... SELECT

Gordon
I think that would be ideal if I knew all actors existed, and just needed to insert into the associative table. But if the actor did not exist, the select insert would not remedy that.
Eddie
+1  A: 

Your performance is glacially slow; something is very Wrong. I assume the following

  • You run your dedicated, otherwise-idle database server on respectable hardware
  • You have tuned it to some extent (i.e. at least configure it to use a few gigs of ram properly) - engine-specific optimisations will be required

You may be being stung by doing lots of tiny operations with autocommit on; this is a mistake as it generates an unreasonable number of disc IO operations. You should do a large amount of work (100, 1000 records etc) in a single transaction then commit it.

The lookups may be slowing things down because of the simple overhead of doing the queries (the queries themselves will be really easy as you'll have an index on actor name).

I also question your method of assuming that no two actors have the same name - surely your original database contains a unique actor ID, so you don't get them mixed up?

MarkR
Thanks MarkI am running a dedicated virtualized server for the db, and the hardware is decent, but my memory usage is capped around 1gig. Even so , looking at my performance stats, memory use is flat, so I think I need to optimize the server to be a little more agressive. I guess i have some reading to do.
Eddie
+1  A: 

This is a perfect opportunity for using prepared statements.
Also take a look at the tips at http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html, e.g.

To speed up INSERT operations that are performed with multiple statements for nontransactional tables, lock your tables

You can also decrease the number of queries. E.g. you can eliminate the SELECT...FROM persons_table to obtain the id by using INSERT...ON DUPLICATE KEY UPDATE and LAST_INSERT_ID(expr).

( sorry, running out of time for a lengthy description, but I wrote an example before noticing the time ;-) If this answer isn't downvoted too much I can hand it in later. )

class Foo {
  protected $persons_table='personsTemp';
  protected $pdo;
  protected $stmts = array();

  public function __construct($pdo) {
    $this->pdo = $pdo;
    $this->stmts['InsertPersons'] = $pdo->prepare('
      INSERT INTO
        '.$this->persons_table.'
        (person)
      VALUES
        (:person)
      ON DUPLICATE KEY UPDATE
        id=LAST_INSERT_ID(id)
    ');
  }

  public function getActorId($name) {
    $this->stmts['InsertPersons']->execute(array(':person'=>$name));
    return $this->pdo->lastInsertId('id');
  }
}

$pdo = new PDO("mysql:host=localhost;dbname=test", 'localonly', 'localonly'); 
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

// create a temporary/test table
$pdo->exec('CREATE TEMPORARY TABLE personsTemp (id int auto_increment, person varchar(32), primary key(id), unique key idxPerson(person))');
// and fill in some data
foreach(range('A', 'D') as $p) {
  $pdo->exec("INSERT INTO personsTemp (person) VALUES ('Person $p')");
}

$foo = new Foo($pdo);
foreach( array('Person A', 'Person C', 'Person Z', 'Person B', 'Person Y', 'Person A', 'Person Z', 'Person A') as $name) {
  echo $name, ' -> ', $foo->getActorId($name), "\n";
}

prints

Person A -> 1
Person C -> 3
Person Z -> 5
Person B -> 2
Person Y -> 6
Person A -> 1
Person Z -> 5
Person A -> 1

(someone might want to start a discussion whether a getXYZ() function should perform an INSERT or not ...but not me, not now....)

VolkerK
This looks promising for the reference tables and should save me thousands of query by the end of the job.Thanks!
Eddie
Nice I have boosted my count to 100 titles per second by leverageing this for genres, formats and actors.DO you know that tthe performance of the insert into v a select is?
Eddie