views:

145

answers:

1

In my news page project, I have a database table news with the following structure:

 - id: [integer] unique number identifying the news entry, e.g.: *1983*
 - title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
 - topic: [string] category which should be chosen by the classificator, e.g: *Sports*

Additionally, there's a table bayes with information about word frequencies:

 - word: [string] a word which the frequencies are given for, e.g.: *real estate*
 - topic: [string] same content as "topic" field above, e.h. *Economics*
 - count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*

Now I want my PHP script to classify all news entries and assign one of several possible categories (topics) to them.

Is this the correct implementation? Can you improve it?

<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
    $pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
    if (!isset($pWords[$pWords3['topic']])) {
        $pWords[$pWords3['topic']] = array();
    }
    $pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
    $pTextInTopics = array();
    $tokens = tokenizer($get3['title']);
    foreach ($pTopics as $topic=>$documentsInTopic) {
        if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
        foreach ($tokens as $token) {
            echo '....'.$token;
            if (isset($pWords[$topic][$token])) {
                $pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
            }
        }
        $pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
    }
    asort($pTextInTopics); // pick topic with lowest value
    if ($chosenTopic = each($pTextInTopics)) {
        echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
    }
}
?>

The training is done manually, it isn't included in this code. If the text "You can make money if you sell real estates" is assigned to the category/topic "Economics", then all words (you,can,make,...) are inserted into the table bayes with "Economics" as the topic and 1 as standard count. If the word is already there in combination with the same topic, the count is incremented.

Sample learning data:

word topic count

kaczynski Politics 1

sony Technology 1

bank Economics 1

phone Technology 1

sony Economics 3

ericsson Technology 2

Sample output/result:

Title of the text: Phone test Sony Ericsson Aspen - sensitive Winberry

Politics

....phone ....test ....sony ....ericsson ....aspen ....sensitive ....winberry

Technology

....phone FOUND ....test ....sony FOUND ....ericsson FOUND ....aspen ....sensitive ....winberry

Economics

....phone ....test ....sony FOUND ....ericsson ....aspen ....sensitive ....winberry

Result: The text belongs to topic Technology with a likelihood of 0.013888888888889

Thank you very much in advance!

+1  A: 

It looks like your code is correct, but there are a few easy ways to optimize it. For example, you calculate p(word|topic) on the fly for every word while you could easily calculate these values beforehand. (I'm assuming you want to classify multiple documents here, if you're only doing a single document I suppose this is okay since you don't calculate it for words not in the document)

Similarly, the calculation of p(topic) could be moved outside of the loop.

Finally, you don't need to sort the entire array to find the maximum.

All small points! But that's what you asked for :)

I've written some untested PHP-code showing how I'd implement this below:

<?php

// Get word counts from database
$nWordPerTopic = mystery_sql();

// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
    // Get total word count in topic
    $nTopic = array_sum($wordCounts);

    // Calculate p(word|topic)
    $pWordPerTopic[$topic] = array();
    foreach($wordCounts as $word => $count)
        $pWordPerTopic[$topic][$word] = $count / $nTopic;

    // Save $nTopic for next step
    $nTopics[$topic] = $nTopic;
}

// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
    $pTopics[$topic] = $nTopic / $nTotal;

// Classify
foreach($documents as $document)
{
    $title = $document['title'];
    $tokens = tokenizer($title);
    $pMax = -1;
    $selectedTopic = null;
    foreach($pTopics as $topic => $pTopic)
    {
        $p = $pTopic;
        foreach($tokens as $word)
        {
            if (!array_key_exists($word, $pWordPerTopic[$topic]))
                continue;
            $p *= $pWordPerTopic[$topic][$word];
        }

        if ($p > $pMax)
        {
            $selectedTopic = $topic;
            $pMax = $p;
        }
    }
} 
?>

As for the maths...

You're trying to maximize p(topic|words), so find

arg max p(topic|words)

(IE the argument topic for which p(topic|words) is the highest)

Bayes theorem says

                  p(topic)*p(words|topic)
p(topic|words) = -------------------------
                        p(words)

So you're looking for

         p(topic)*p(words|topic)
arg max -------------------------
               p(words)

Since p(words) of a document is the same for any topic this is the same as finding

arg max p(topic)*p(words|topic)

The naive bayes assumption (which makes this a naive bayes classifier) is that

p(words|topic) = p(word1|topic) * p(word2|topic) * ...

So using this, you need to find

arg max p(topic) * p(word1|topic) * p(word2|topic) * ...

Where

p(topic) = number of words in topic / number of words in total

And

                   p(word, topic)                         1
p(word | topic) = ---------------- = p(word, topic) * ----------
                      p(topic)                         p(topic)

      number of times word occurs in topic     number of words in total
   = -------------------------------------- * --------------------------
            number of words in total           number of words in topic

      number of times word occurs in topic 
   = --------------------------------------
            number of words in topic
Michael Clerx
As an afterthought... Did you think of removing stop words (the, and it, etc) from the word count list?
Michael Clerx
Thank you very much for this answer. Removing stop words is a nice way to improve the results.