views:

161

answers:

1

I have been set a challenge to create an indexer that takes all words 4 characters or more, and stores them in a database along with how many times the word was used.

I have to run this indexer on 4,000 txt files. Currently, it takes about 12-15 minutes - and I'm wondering if anyone has a suggestion for speeding things up?

Currently I'm placing the words in an array as follows:

// ==============================================================
// === Create an index of all the words in the document
// ==============================================================
function index(){

 $this->index = Array();
 $this->index_frequency = Array();

 $this->original_file = str_replace("\r", " ", $this->original_file);
 $this->index = explode(" ", $this->original_file);

 // Build new frequency array
 foreach($this->index as $key=>$value){

  // remove everything except letters
  $value = clean_string($value);

  if($value == '' || strlen($value) < MIN_CHARS){
   continue;
  }

  if(array_key_exists($value, $this->index_frequency)){
   $this->index_frequency[$value] = $this->index_frequency[$value] + 1;
  } else{
   $this->index_frequency[$value] = 1;
  }

 }

 return $this->index_frequency;

}

I think the biggest bottleneck at the moment is the script to store the words in the database. It needs to add the document to the essays table and then if the word exists in the table just append essayid(frequency of the word) to the field, if the word doesnt exist, then add it...

// ==============================================================
// === Store the word frequencies in the db
// ==============================================================
private function store(){

 $index = $this->index();

 mysql_query("INSERT INTO essays (checksum, title, total_words) VALUES ('{$this->checksum}', '{$this->original_filename}', '{$this->get_total_words()}')") or die(mysql_error());

 $essay_id = mysql_insert_id();

 foreach($this->index_frequency as $key=>$value){

  $check_word = mysql_result(mysql_query("SELECT COUNT(word) FROM `index` WHERE word = '$key' LIMIT 1"), 0);

  $eid_frequency = $essay_id . "(" . $value . ")";

  if($check_word == 0){
   $save = mysql_query("INSERT INTO `index` (word, essays) VALUES ('$key', '$eid_frequency')");
  } else {
   $eid_frequency = "," . $eid_frequency;
   $save = mysql_query("UPDATE `index` SET essays = CONCAT(essays, '$eid_frequency') WHERE word = '$key' LIMIT 1");
  }

 }

}

Any ideas??

+1  A: 

You might consider profiling your app to know exactly where are your bottlenecks. This might give you a better understanding of what can be improved.

Regarding DB optimisation: check if you have an index on word column, then try lowering the number of times you access DB. INSERT ... ON DUPLICATE KEY UPDATE ..., maybe?

n1313
Thanks n1313! I worked on reducing the amount of times I queried the db. Thanks for your help.
Matt