I want to cluster texts for a news website.
At the moment I use this algorithm to find the related articles. But I found out that PHP's similar_text() gives very good results, too.
What sort of algorithms do "the big ones", Google News, Topix, Techmeme, Wikio, Megite etc., use? Of course, you don't know exactly how the algorithms work. It's secret. But maybe someone knows approximately the way they work?
The algorithm I use at the moment is very slow. It only compares two articles. So for having the relations between 5,000 articles you need about 12,500,000 comparisons. This is quite a lot.
Are there alternatives to reduce the number of necessary comparisons? [I don't look for improvements for my algorithm.] What do "the big ones" do? I'm sure they don't always compare one article to another and this 12,500,000 times for 5,000 news.
It would be great if somebody can say something about this topic.