Hello!
I have two PHP functions to calculate the relation between two texts. They both use the bag of words model but check2() is much faster. Anyway, both functions give the same results. Why? check1() uses one big dictionary array containing ALL words - as described in the bag of words model. check2() doesn't use one big array but an array containing only the words of one text. So check2() shouldn't work but it doesn. Why do both functions give the same results?
function check1($terms_in_article1, $terms_in_article2) {
global $zeit_check1;
$zeit_s = microtime(TRUE);
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
$zeit_e = microtime(TRUE);
$zeit_check1 += ($zeit_e-$zeit_s);
return $score;
}
function check2($terms_in_article1, $terms_in_article2) {
global $zeit_check2;
$zeit_s = microtime(TRUE);
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$score_table = array();
foreach($terms_in_article1 as $term){
if(!isset($score_table[$term])) $score_table[$term] = 0;
$score_table[$term] += 1;
}
$score_table2 = array();
foreach($terms_in_article2 as $term){
if(isset($score_table[$term])){
if(!isset($score_table2[$term])) $score_table2[$term] = 0;
$score_table2[$term] += 1;
}
}
$score = 0;
foreach($score_table2 as $key => $entry){
$score += $score_table[$key] * $entry;
}
$score = $score/($length1*$length2);
$score *= 500;
$zeit_e = microtime(TRUE);
$zeit_check2 += ($zeit_e-$zeit_s);
return $score;
}
I hope you can help me. Thanks in advance!