views:

1278

answers:

11

I want to cluster texts for a news website.

At the moment I use this algorithm to find the related articles. But I found out that PHP's similar_text() gives very good results, too.

What sort of algorithms do "the big ones", Google News, Topix, Techmeme, Wikio, Megite etc., use? Of course, you don't know exactly how the algorithms work. It's secret. But maybe someone knows approximately the way they work?

The algorithm I use at the moment is very slow. It only compares two articles. So for having the relations between 5,000 articles you need about 12,500,000 comparisons. This is quite a lot.

Are there alternatives to reduce the number of necessary comparisons? [I don't look for improvements for my algorithm.] What do "the big ones" do? I'm sure they don't always compare one article to another and this 12,500,000 times for 5,000 news.

It would be great if somebody can say something about this topic.

+10  A: 

I find Apache Solr's related items algorithm to be phenomenal, and as it's open source the algorithms are freely available. Well worth a look. Add in a boost when the dates of the items are similar and you'd be well on your way.

ceejayoz
So when you say that Solr is great - I have to use it! :) What method should I use then? I've just seen that there is a XML API. This would be perfect for me. But I couldn't find an adequate method. Some part of "Text Analysis"? I read much about the stemmer but not about clustering!?
You need the 'mlt' (More Like This) method. http://wiki.apache.org/solr/MoreLikeThis
ceejayoz
Thanks, I'll have a look at it. Do you know whether I can use it from PHP?
Yes. Here's an implementation for Drupal, a PHP CMS: http://drupal.org/project/apachesolr
ceejayoz
I should point out that the MoreLikeThis algo is a) fast, and b) stupid. It's purely keyword based, so can't be as good as, say, latent semantic indexing on a fixed dataset. That being said, it's a classic 80/20 solution - good enough for 80% of uses for 20% of the effort.
+6  A: 

Variations or combinations of standard clustering algorithms:

Also, using tags is an easy way to do it. If people tag things for you you can automatically suggest based on tags (and probably a clustering over the title of the articles will be useful).

A good thing to try is to use the CLUTO on the data and test various algorithms.

BTW, there is no need to compare the whole contents of the articles each time, only the feature vector, which gets recalculated only when the docyment changes, if ever.

Vinko Vrsalovic
Thanks, CLUTO sounds very good. Unfortunately, I can't install new packages so I can only use APIs or algorithms which I code in PHP by myself.You wrote that I only have to compare the feature vectors so I needn't create them every time. Right? I create it one time and save it for the next times, don't I? But the document vector depends on the dictionary array which contains all words of all texts. So I have to recompute it every single time!?How can you use k-means for news clustering? Every day, I get about 1,500 new articles. I have no fixed cluster centers which k-means needs.
If you can't use CLUTO in your app, you certainly can use it to evaluate different algorithms to choose the best suited to your app, which then you can implement in your app. You can (and maybe should) prepare a dictionary to use you most certainly don't need every new word on a document to be added to the dictionary. Also you can store possible new words and monitor it in case there are relevant words to incorporate.
Vinko Vrsalovic
+8  A: 

The key idea is usually to turn the text into feature vectors that can be easily compared. One (quite simple) solution is to create a dictionary and then build a bit vector for every text where every bit indicates the presence of a specific word from the dictionary.

Dictionary: animal, cat, dog, house, tree, eat, garden, gras, love, flower

Text 1: Our cat loves to eat flowers in the garden in front of our house.
Feature vector: 01010 11011

Text 2: We have two animals - a dog and a cat.
Feature vector: 11100 00000

Text 3: There are many kinds of flowers in our garden and a huge tree.
Feature vector: 00001 01001

Now you can just compare the feature vectors using Hamming distance. In the example the distances are the following.

Text 1: 01010 11011        Text 1: 01010 11011        Text 2: 11100 00000 
Text 2: 11100 00000        Text 3: 00001 01001        Text 3: 00001 01001
        10110 11011 => 7           01011 10010 => 6           11101 01001 => 5

A real implementation has to solve some linguistic problems - for example stemming.

Daniel Brückner
Yes, that's exactly what I'm using at the moment. See the link I posted in my question. ;) But the description is very good. If I'd had such a good description, I would have understood it much faster. :) Thanks!
lol ... I should have read the link ... :D
Daniel Brückner
+1 for the explanation
DrDro
+1  A: 

Cluto is a good tool for clustering but I have used another algorithm that converges extremely fast and very efficient. Here is the link to it. You should be able to implement it in any language of your choice. It works on labelled graphs.

If your webhost supports PHP then it must support python/Perl. I would suggest you use standard graph packages in PERL and python that does this sort of computations. You can find a whole lot of PERL packages for clustering, similarity, collaborative filtering etc on cpan

I have also used networkx for python which is pretty awesome for basic graph algorithms. Of course, all this is assuming you can represent your content as graphs.

Ritesh M Nayak
"If your webhost supports PHP then it must support python/Perl." - Not necessarily.
Joey Robert
The first algorithm you mentioned ("BiemannTextGraph06.pdf") sounds very interesting. Could you please tell something more precisely about the algorithm? That would be nice.
+3  A: 

There was an NPR interview with a NYTimes Senior Software Architect where he stated that they used Cosine Similarity to cluster words and phrases in their news search trends:

Ben
Thank you very much, especially for the reference. So I can be sure that "the big ones" use this algorithm. :) It must be good ... :)
Even so, cosine similarity just tells you how they compare feature vectors, not how they generate them. Thar be secret sauce :-)
But even comparing the feature vectors is important, right? ;)
I had to look it up so if anyone else needs to - http://en.wikipedia.org/wiki/Cosine_similarity
marshall
+9  A: 

Here is an optimization that you can do:

One of our major performance optimizations for the “related questions” query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster.

It's from Stack Overflow podcast #32.

Nick D
Thank you very much, this sounds interesting. I'll test it.
Thanks, sounds like interesting method to me. Will try :)
Kuroki Kaze
A: 

Not exactly an answer but here is some 3rd party software that might help you out: http://www.krollontrack.com/ndmetric/

Ravedave
Hardly related.
dmeister
+3  A: 

The trick isn't in the similarity algorithm, but in identifying the features of each article. In general, features should include only the most representative words, you can start by using words that are much more frequent in the article itself then in a general corpus.

Meidan Alon
Thank you for this answer. Do you think it would be good to use tf-idf and only select the 10 most important words? Is this the formula? $tf_idf = $occurrences_in_this_document/$words_in_this_document * log10($documents_in_database/$documents_in_database_containing_this_word)
+5  A: 

I sat through a talk by a guy from the Google AdWords / AdSense team recently on just this subject. His comments were specifically for analyzing similarity content for optimizing ad choice, but from other things he said I gather they use similar techniques for news, search results, etc.

Specifically, he said that they make heavy use of latent semantic indexing (via singular value decomposition, which is an algorithm you could implement in PHP, though it would be slow for large data sets). The seminal paper on the subject is Deerwester, Dumais, Furnas, Landauer, and Harshman, 1990, with some influential follow-up work by Blei, Ng, and Jordan. The wikipedia pages (linked above) give some reference implementations of the technique, though I haven't found one in PHP. The singular value decomposition step effectively eliminates words that are common enough as to have little contribution to the semantic content of the paper, and is more robust than, e.g., removing the 10,000 most common English words from the corpus.

Dathan
Thank you, this is really interesting. I've already tried to use LSI but I couldn't do it. No problems in most parts but I can't accomplish the Singular Value Decomposition. I've spent lots of days on it but no chance. :( Only using external libraries, but that's not possible for me.
A little self-referential irony for you -- I went looking on Google (which uses LSI) for a PHP implementation of Singular Value Decomposition. Apparently Singular Value Decomposition is strongly semantically correlated with the abbreviation "SVD" -- strongly enough that six of the first ten hits for "Singular Value Decomposition PHP" were for pages that have nothing to do with singular value decomposition, but featured the abbreviation SVD, which wasn't even part of my search! Sometimes even sophisticated algorithms fail. (c:
Dathan
A thought occurs to me -- the SVD operation only needs to be done occasionally, and you don't necessarily have to do it on the server. If you want, you can calculate the LSI dictionary offline and upload it to the server -- then you don't have to find an SVD implementation in PHP, just run the text of each incoming article through a process that picks out the most semantically significant terms in it according to your LSI dictionary. You just reindex the entire corpus offline every so often to keep the system up to date.
Dathan
EDIT: I think "correlation matrix" is a better term than "dictionary" in this case, though I'm not sure what the literature calls it.
Dathan
And by "reindex the entire corpus offline" I mean download every article, or a large number of randomly-selected articles and re-calculated the correlation matrix. The actual re-indexing indexing of each article would probably need to be done on the server, for efficiency and bandwidth reasons.
Dathan
+1  A: 

Google uses distributed process for almost everything (not only very smart algorithms). It published papers on its proprietary architeture that enables algorithms to actually make it:

  • Google File System is the distributed file system. Base of it all
  • Map Reduce is a framework for using functional programming concepts and making distributed programming easier.
  • Big Table is the distributed non-relational database. In spite of being non-relational, Google made it avaible throguh JPA, as it can be seen on Google App Engine for Java.
  • Chubby is a distributed lock server used on all the server/client failover/election management.
  • Sawzall is a domain specific language that allows developers to craft common Map Reduce applications with a lot less coding. Uses the now open source Protocol Buffers as a multi-language, blindling fast, serialization mechanism.
Daniel Ribeiro
Thank you, very interesting to know but it's not what I'm looking for. I'm interested in algorithms, not in the (necessary) architeture.
+2  A: 

Some others have provided excellent explanations and I had fun reading them.

This is not an algorithmic description but since you are using PHP, it could provide you the straight up working solution. I would strongly recommend going with the Standard Analyzer which is an implementation of the Zend port of Lucene from Java. It does:

  • Stem words for greater search relevancy
  • Use the pre-existing Zend lowercase filter
  • Filter out a standard set of stop words using the Zend stop words filter

While you will need the Zend library, you can have an independent install as well. The author wrote the wpSearch which is a working implementation and in my opinion the best search engine available for the CMS. You could take your cue from the plugin implementation and get it off the ground.

aleemb