tags:

views:

253

answers:

2

Hello!

I have two PHP functions to calculate the relation between two texts. They both use the bag of words model but check2() is much faster. Anyway, both functions give the same results. Why? check1() uses one big dictionary array containing ALL words - as described in the bag of words model. check2() doesn't use one big array but an array containing only the words of one text. So check2() shouldn't work but it doesn. Why do both functions give the same results?

function check1($terms_in_article1, $terms_in_article2) {
    global $zeit_check1;
    $zeit_s = microtime(TRUE);
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    $zeit_e = microtime(TRUE);
    $zeit_check1 += ($zeit_e-$zeit_s);
    return $score;
}
function check2($terms_in_article1, $terms_in_article2) {
    global $zeit_check2;
    $zeit_s = microtime(TRUE);
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $score_table = array();
    foreach($terms_in_article1 as $term){
        if(!isset($score_table[$term])) $score_table[$term] = 0;
        $score_table[$term] += 1;
    }
    $score_table2 = array();
    foreach($terms_in_article2 as $term){
        if(isset($score_table[$term])){
            if(!isset($score_table2[$term])) $score_table2[$term] = 0;
            $score_table2[$term] += 1;
        }
    }
    $score = 0;
    foreach($score_table2 as $key => $entry){
        $score += $score_table[$key] * $entry;
    }
    $score = $score/($length1*$length2);
    $score *= 500;
    $zeit_e = microtime(TRUE);
    $zeit_check2 += ($zeit_e-$zeit_s);
    return $score;
}

I hope you can help me. Thanks in advance!

+2  A: 

Both functions implement pretty much the same algorithm, but while the first one does it in straightforward way, the second one is a bit more clever and skips a portion of unneccessary work.

check1 goes like this:

// loop length(words1) times
for each word in words1:
    freq1[word]++

// loop length(words2) times
for each word in words2:
    freq2[word]++

// loop length(union(words1, words2)) times
for each word in union(words1, words2):
    score += freq1[word] * freq2[word]

But remember: when you multiply something with zero, you will get zero.

This means, that counting the frequencies of words that aren't in both sets is a waste of time - we multiply the frequency by zero and that will add nothing to the score.

check2 takes this into account:

// loop length(words1) times
for each word in words1:
    freq1[word]++

// loop length(words2) times
for each word in words2:
    if freq1[word] > 0:
        freq2[word]++

// loop length(intersection(words1, words2)) times
for each word in freq2:
    score += freq1[word] * freq2[word]
Rene Saarsoo
+3  A: 

Hello, since you seem to concerned about performance, here's an optimized version of the algorithm in your check2 function that uses some more built-in functions for improved speed.

function check ($terms1, $terms2)
{
    $counts1 = array_count_values($terms1);
    $totalScore = 0;
    foreach ($terms2 as $term) {
     if (isset($counts1[$term])) $totalScore += $counts1[$term];
    }
    return $totalScore * 500 / (count($terms1) * count($terms2));
}
Werner
Thank you very very much, Werner. You're right, your version is even faster than the fastest of my two versions. Unfortunately, I asked for the cause why the other functions work, too. So I must choose Rene Saarsoo's answer as the best one since he answered my question perfectly. But you helped me alot, too. Thanks! :)
+1 for showing beautifully fast code!
0scar