Hello!
Imagine the following problem:
- You have a database containing about 20,000 texts in a table called "articles"
- You want to connect the related ones using a clustering algorithm in order to display related articles together
- The algorithm should do flat clustering (not hierarchical)
- The related articles should be inserted into the table "related"
- The clustering algorithm should decide whether two or more articles are related or not based on the texts
- I want to code in PHP but examples with pseudo code or other programming languages are ok, too
I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.
APPROACH 1
<?php
$zeit = time();
function check($str1, $str2){
$minprozent = 60;
similar_text($str1, $str2, $prozent);
$prozent = sprintf("%01.2f", $prozent);
if ($prozent > $minprozent) {
return TRUE;
}
else {
return FALSE;
}
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
$rel2 = mysql_query($rel1);
$rel2a = mysql_num_rows($rel2);
if ($rel2a > 0) {
while ($rel3 = mysql_fetch_assoc($rel2)) {
if (check($sql3['text'], $rel3['text']) == TRUE) {
$id_a = $sql3['id'];
$id_b = $rel3['id'];
$rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
$rein2 = mysql_query($rein1);
$rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
$rein4 = mysql_query($rein3);
}
}
}
}
?>
APPROACH 2 [only check()]
<?php
function square($number) {
$square = pow($number, 2);
return $square;
}
function check($text1, $text2) {
$words_sub = text_splitter($text2); // splits the text into single words
$words = text_splitter($text1); // splits the text into single words
// document 1 start
$document1 = array();
foreach ($words as $word) {
if (in_array($word, $words)) {
if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
}
}
$rating1 = 0;
foreach ($document1 as $temp) {
$rating1 = $rating1+square($temp);
}
$rating1 = sqrt($rating1);
// document 1 end
// document 2 start
$document2 = array();
foreach ($words_sub as $word_sub) {
if (in_array($word_sub, $words)) {
if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
}
}
$rating2 = 0;
foreach ($document2 as $temp) {
$rating2 = $rating2+square($temp);
}
$rating2 = sqrt($rating2);
// document 2 end
$skalarprodukt = 0;
for ($m=0; $m<count($words)-1; $m++) {
$skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
}
if (($rating1*$rating2) == 0) { continue; }
$kosinusmass = $skalarprodukt/($rating1*$rating2);
if ($kosinusmass < 0.7) {
return FALSE;
}
else {
return TRUE;
}
}
?>
I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.
I hope you can help me. Thanks in advance!