ansaurus

Question

Compare lots of texts (clustering) with a matrix

Answer 1

+1 A:

EDIT: Trying to be more explicit:

First, encode every term into an integer. You can use a dictionary associative array, like this:
```
   $count = 0;
    foreach ($doc as $term) {
      $val = $dict[$term];
      if (!defined($val)) {
        $dict[$term] = $count++;
      }
      $doc_as_int[$val] ++;
    }
```
This way, you replace string calculations with integer calculations. For example, you can represent the word "cloud" as the number 5, and then use the index 5 of arrays to store counts of the word "cloud". Notice that we only use associative array search here, no need for CRC etc.
Do store all texts as a matrix, preferably a sparse one.
Use feature selection (PDF).
Maybe use a native implementation in a faster language.
I suggest you first use K-means with about 20 clusters, this way get a rough draft of which document is near another, and then compare only pairs inside each cluster. Assuming uniformly-sized cluster, this improves the number of comparisons to 20*200 + 20*10*9 - around 6000 comparisons instead of 19900.

Yuval F 2009-05-23 18:33:22

Thank you. Unfortunately, I don't really understand how this should work. Would this really speed up my calculations? In your linked document, I can't find anything useful for my case. There's a lot concerning k-means etc. but I need the relationship values for ALL the documents among each other. So for 200 texts, there would be 200! relationship, right?

2009-05-25 15:06:02

It will speed up your calculations. Unless your programming language compares strings faster than it does integers, encoding them as integers will improve performance. Ditto for sparse matrices. If you need to compare 200 texts to each other, the actual number is 200^2, which is much less than 200!. I still believe you could use feature selection to make this task tractable.

Yuval F 2009-05-25 18:56:54

Thanks for the additional information. Comparing each article to each to each other, you won't have 200^2=40,000 comparisons but 19,900 - which is still alot.Assuming that PHP compares integers faster than strings: How do I transform a string into a number? I don't know any hash algorithm which does this. My only idea is using a base-36 alphabet: http://paste.bradleygill.com/index.php?paste_id=9604 But then the numbers get huge very quickly. Faster, though?And it would be great if you could explain the comparison process with a sparse matrix more detailed. Sorry, I didn't understand it.

2009-05-25 19:54:45

I've tried to transform strings into integers: It doesn't speed up my calculations. The hashing takes some time and the very small advantage of integer comparisons (if there is one) can't balance this. I hashed the strings with crc32(). It took more than 4 times longer than with string comparisons.

2009-05-26 18:41:16

Answer 2

+2 A:

You can split the text on adding it. Simple example: preg_match_all(/\w+/, $text, $matches); Sure real splitting is not so simple... but possible, just correct the pattern :)

Create table id(int primary autoincrement), value(varchar unique) and link-table like this: word_id(int), text_id(int), word_count(int). Then fill the tables with new values after splitting text.

Finally you can do with this data anything you want, quickly operating with indexed integers(IDs) in DB.

UPDATE: Here are the tables and queries:

CREATE TABLE terms (
    id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
    PRIMARY KEY  (`id`), UNIQUE KEY `value` (`value`)
);

CREATE TABLE `terms_in_articles` (
    term int(11) NOT NULL, 
    article int(11) NOT NULL, 
    cnt int(11) NOT NULL default '1',
    UNIQUE KEY `term` (`term`,`article`)
);


/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value 
FROM terms t, terms_in_articles a 
WHERE a.term = t.id AND a.article IN (1, 2);

/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt 
FROM terms_in_articles 
WHERE article IN (1, 2) ORDER BY article;

/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total 
FROM terms_in_articles 
WHERE article IN (1, 2) GROUP BY article;

/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM 
(
    SELECT (a1.cnt * a2.cnt) AS term_score 
    FROM terms_in_articles a1, terms_in_articles a2 
    WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
    GROUP BY a2.term, a1.term
) AS tmp;

Well, now, I hope, this will help? The 2 last queries are enough to perform your task. Other queries are just in case. Sure, you can count more stats like "the most popular terms" etc...

Jet 2009-05-23 21:13:01

Thanks! I know how to split a text into single terms (see code in my question). The approach with the database sounds interesting and it would be really easy to do this. But which advantage would this have for the calculation of the relationship? My (unreal) wish would be that I fill all word counts into a huge matrix and an algorithm gives an array with all relationships as its output.

2009-05-25 15:00:27

The advantage is that DB engines were built to process large datasets quickly and safely, and with correct settings I think that DB-server will do the calculations on dataset better than any PHP algoritm. Your idea about matrix is good for languages, where you can manage memory access, but you know - in PHP you can't. Just imagine the size of your matrix in memory. Every script will eat this size and you potentially risk to meet performance issues.

Jet 2009-05-26 09:16:58

Thanks. It stands to reason that it is faster this way. You've convinced me. But how do I calculate the relations when I've inserted everything into the DB? Can MySQL do this? What query must I use?

2009-05-26 18:43:37

Sure. Sorry, now have no time to write and test SQL. I'll update answer with whal SQL-s must look like.

Jet 2009-05-27 16:33:21

Thanks for the work! I hope the question is still open if you post the SQL code. Otherwise I couldn't choose your answer as the best one. :) But does your approach really give the same results as my vector method? I'm asking because I also consider words appearing only in 1 text which you don't do!?

2009-05-27 17:16:50

Updated again. These queries you can use as they provide the same algoritm as in your script. Use 3-rd query to get lengths and then 4-th to get scores.

Jet 2009-05-28 09:17:59

Thank you very much! Still some questions: Is "terms" the dictionary table which contains all unique terms one time? Should "value" contain the tag name? In "terms_in_articles", "cnt" is the number of occurrences, right? Don't you think that int(11) is a bit oversized? Wouldn't tinyint(1) be enough for normal texts? Why do I need the 3rd query? Wouldn't the 4th query alone do everything I want? And, finally: Must I put in the article ids in the where conditions of the 4th query?

2009-05-28 15:54:04

This is an example. There are still some performance tunings you may apply (i.e int(4) instead of int(11) etc.) The 3-rd query is needed to get $length1, $length2 because u use them after scores calculation. Sure you must put real article ids instead of 1 and 2 in 4-th query. 'value' in terms should contain the word itself (just in case). This structure is a little wider than you need and as I said before you may do much more different calculations on it.

Jet 2009-05-29 10:11:27

And more about 3-rd query - you have to run it separatrely, because including this calculation in 4-th query will cause serious performance issues (the query gets too complex).

Jet 2009-05-29 10:13:57

Answer 3

A:

If you can use simple text instead of arrays for comparing, and if i understood right where your goal is, you can use the levenshtein php function (that is usually used for give the google-like 'Did you meaning ...?' function in php search engines).

It works in the opposite way youre using: return the difference between two strings.

Example:

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';
$c = 'this is just a test';

echo check($a, $b) . '<br />';
//return 5
echo check($a, $c) . '<br />';
//return 0, the strings are identical
?>

But i dont know exactly if this will improve the speed of execution.. but maybe yes, you take-out many foreach loops and the array_merge function.

EDIT:

A simply test for the speed (is a 30-second-wroted-script, its not 100% accurated eh):

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}


$a = array('this', 'is', 'just', 'a', 'test');
$b = array('this', 'is', 'not', 'test');

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);

for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';

print: end in 0.36765 seconds

Second test:

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05023 seconds

So, yes, seem faster. Would be nice to try with many array items (and many words for levenshtein)

2°EDIT:

With similar text the speed seem to be equal to the levenshtein method:

<?php
function check($a, $b) {
    return similar_text($a, $b);
}

$a = 'this is just a test ';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05988 seconds

But it can take more than 255 char:

Note also that the complexity of this algorithm is O(N**3) where N is the length of the longest string.

and, it can even return the similary value in percentage:

function check($a, $b) {
    similar_text($a, $b, $p);
    return $p;
}

Yet another edit

What about create a database function, to make the compare directly in the sql query, instead of retrieving all the data and loop them?

If youre running Mysql, give a look at this one (hand-made levenshtein function, still 255 char limit) Else, if youre on Postgresql, this other one (many functions that should be evalutate)

DaNieL 2009-05-26 14:17:31

Thank you very much, I didn't think of this function yet. This would really be an alternative. But I don't know what gives better results: The vector-based method (see code in my question) or the Levenshtein algorithm. Does somebody know?

2009-05-26 14:38:11

Just edited with a simple test

DaNieL 2009-05-26 15:12:23

Your idea to compare the execution times is very good! Thanks! I've tried this test again with much more data. I did it with 100 documents with 200 words each. And here's the result:My own check() function = 8.56277768 seconds on averageLevenshtein = Error: Argument string(s) too long. So Levenshtein is not adequate for this problem since the algorithm I'm looking for should also handle long strings.

2009-05-26 18:39:08

Sorry, i didnt know that limit, reading the documentation it is 255 char.Well, there are other 2 function in php like levenshtein: similar_text() and soundex(), the first one shouldnt have that limit:http://it2.php.net/manual/en/function.similar-text.phpand http://it2.php.net/manual/en/function.soundex.phpMaybe tomorrow morning i'll have time to repeat my simple test with those two

DaNieL 2009-05-26 18:44:16

SoundEx() won't work at all I think. I guess similar_text() will work much faster than my complex matrix calculations. But probably the results of similar_text() will be worse.

2009-05-26 20:04:07

Why the similar_text result is worse? Can even return the percentage..Man, can you post a piece of the code youre using, with 2 or more elements as example of what you need to compare?

DaNieL 2009-05-27 06:12:27

I just edited again, with some solutions for compare by database instead of php

DaNieL 2009-05-27 08:58:20

Ok, similar_text() seems to be much faster. But I don't know whether the results can compete against the vector analysis. Bigger sites like Google News or Techmeme use vectors or matrices for clustering, too, I think. They wouldn't use a simple function like similar_text(). But to compare the results, I've set up a page: http://www.twem.de/_cluster_methods.php I don't know if it helps if you can't speak German. The results of similar_text() are surprisingly good.But I want to reduce the number of comparisons, not another function. So a MySQL query doing this would be great. Or matrices! :)

2009-05-27 15:31:02

Well, if you are meaning to have the same amount of data of google, then i dont think PHP is the best language.. slq neither (not even postgresql). Give a look to Python, would be pretty faster for a lot of strings comparation (but sincerely i dont know the python's equivalent of levenshtein or similar_text, but surely there will be at least one).

DaNieL 2009-05-27 15:44:51

No, not the same amount of data as Google. But the same quality should be my aim! :) Only the same quality of clustering news in Google News, of course. :D It's clear that I will never achieve the same quality in ranking news and websites and so on.I needn't use Python since my data amount isn't very high. But imagine: 2,000 news articles per day - that would be 2,000,000 comparisons per day yet. This would last forever.So I need a method to reduce the number of comparisons, e.g. huge matrices or SQL queries doing this. ;)

2009-05-27 15:57:29

Wow! Thanks for something new! I had no idea about "levenshtein". Not sure that this is an answer here (it can be too complex), but it's a very interesting algoritm anyway...

Jet 2009-05-27 16:59:46

@marco92w: have you considered some apache chaching system? you do the comparison each morning at 01.00 am, then show the result cached. And if you have to update the some records, then re-cache all manually. This way wouldnt be in real-time updated, this could be a problem for you?@engeniy: yes, is curious how so many people never heard about the php levenshtein function ;)

DaNieL 2009-05-27 17:05:28

This is really a good idea. It would work for me. But it would increase the speed only a bit. The problem is the huge number of comparisons to make, not the slowness of the tokenizer which splits the text into single words. So caching wouldn't speed up the algorithm so much.

2009-05-27 19:11:37

Well, i dont know how to help you further.. maybe, to speed up the comparisons you could remove from the strings some useless words (for example the articles, 'the', 'on', ecc..) or, instead of compare the entities word by ord, you could use a such 'tag-system', giving at every item some tags (as you gave tags to this question), and then compare them... but this would be manually, so i dont think can be your way

DaNieL 2009-05-28 06:08:54

Answer 4

+1 A:

Here's a slightly optimized version of your original function. It produces the exact same results. (I run it on two articles from Wikipedia with 10000+ terms and like 20 runs each:

check():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 1.0707

check2():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 0.2624

Here's the code:

function check2($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words

    $score_table = array();
    foreach($terms_in_article1 as $term){
        if(!isset($score_table[$term])) $score_table[$term] = 0;
        $score_table[$term] += 1;
    }
    $score_table2 = array();
    foreach($terms_in_article2 as $term){
        if(isset($score_table[$term])){
            if(!isset($score_table2[$term])) $score_table2[$term] = 0;
            $score_table2[$term] += 1;
        }
    }
    $score =0;
    foreach($score_table2 as $key => $entry){
        $score += $score_table[$key] * $entry;
    }
    $score = $score / ($length1*$length2);
    $score *= 500;
    return $score;
}

(Btw. The time needed to split all the words into arrays was not included.)

0scar 2009-05-27 16:56:35

It isn't slightly improved. It's extremely improved! :) Thanks man, this is exactly what I was looking for. It's four times faster! Why does it work? This is the only thing I still want to know. ;) You use 2 word arrays, I use 1 big word array. Why does it give the same results?

2009-05-27 17:12:05

The trick is that if a term only exists in one of the arrays, it results in 0. This function skips all entries that only exists in one of the arrays, so to speak. Glad you like it :) Cheers! Good Luck!

0scar 2009-05-27 17:14:23

(Also it skips doing array_merge and array_unique and stuff like that..)

0scar 2009-05-27 17:14:58

But Wikipedia says (http://en.wikipedia.org/wiki/Bag_of_words_model_in_computer_vision) that you have to put everything into a big "dictionary" array at first. Why does the algorithm also work if you don't do this? You have other vectors!?

2009-05-27 19:13:48

Thanks, nice code I think I am going to use it;but I have a question when I applied the function for these two arrays $array1 = array(1,2,3,4,4); $array2 = array(1,2,3,4,5); echo check2($array1,$array2); it gives me the output "100" why? the arrays are different!?

ahmed 2009-05-31 07:42:15

Hm well, the 4 is matched twice, so it adds 2 in score. So you've got 1*1 + 1*1 + 1*1 + 1*2 = 5. Then that times 500 divided by the length, which is 5. That's 100. Is it different from your original version? Btw, the function is the fastest if you pass the smallest array as the first argument, I believe. Ie check2(small_array, larger_array). Also, there are more ways to make this faster. You could save the score_table for each document in a db. You could also hash every term to a number, like others have suggested. In either case, test this function _thouroghly_ before using it! (I haven't...)

0scar 2009-05-31 12:14:27

(I meant divided by length1*length2, which is 25.)

0scar 2009-05-31 12:19:30

Answer 5

A:

Another approach to take would be Latent Semantic Analysis, which leverages a large corpus of data to find similarities between documents.

The way it works is by taking the co-occurance matrix of the text and comparing it to the Corpus, essentially providing you with an abstract location of your document in a 'semantic space'. This will speed up your text comparison, as you can compare documents using Euclidian distance in the LSA Semantic space. It's pretty fun semantic indexing. Thus, adding new articles will not take much longer.

I can't give a specific use case of this approach, having only learned it in school but it appears that KnowledgeSearch is an open source implementation of the algorithm.

(Sorry, its my first post, so can't post links, just look it up)

Brohan 2009-06-01 14:14:24

ansaurus

tags:

views:

answers:

Compare lots of texts (clustering) with a matrix

related questions