tags:

views:

81

answers:

6
+4  Q: 

PHP find relevance

Hi,

Say I have a collection of 100,000 articles across 10 different topics. I don't know which articles actually belong to which topic but I have the entire news article (can analyze them for keywords). I would like to group these articles according to their topics. Any idea how I would do that? Any engine (sphinx, lucene) is ok.

A: 

you could use sphinix to search for all the articles for all the 10 different topics and then set a threshold as to the number of matches that what make an article linked to a particular topic, and so on

ovais.tariq
The thing is, I don't know what topic it's going to be. It's dynamic.
Patrick
A: 

I recommend the book "Algorithms of the Intelligent Web" by Haralambos Marmanis and Dmitry Babenko. There's a chapter on how to do this.

Scott Saunders
A: 

I don't it is possible to completely automate this, but you could do most of it. The problem is where would the topics come from?

Extract a list of the most non-common words and phrases from each article and use those as tags.

Then I would make a list of Topics and assign words and phrases which would fall within that topic and then match that to the tags. The problem is that you might get more than one topic per article.

Perhaps the best way would be to use some form of Bayesian classifiers to determine which topic best describes the article. It will require that you train the system initially.

This sort of technique is used on determining if an email is SPAM or not.

This article might be of some help

KSS
+2  A: 

In term of machine learning/data mining, we called these kind of problems as the classification problem. The easiest approach is to use past data for future prediction, i.e. statistical oriented: http://en.wikipedia.org/wiki/Statistical_classification, in which you can start by using the Naive Bayes classifier (commonly used in spam detection)

I would suggest you to read this book (Although written for Python): Programming Collective Intelligence (http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325), they have a good example.

tszming
+1  A: 

Dirt simple way to create a classifier:

Hand read and bucket N example documents from the 100K into each one of your 10 topics. Generally, the more example documents the better.

Create a Lucene/Sphinx index with 10 documents corresponding to each topic. Each document will contain all of the example documents for that topic concatenated together.

To classify a document, submit that document as a query by making every word an OR term. You'll almost always get all 10 results back. Lucene/Sphinx will assign a score to each result, which you can interpret as the document's "similarity" to each topic.

Might not be super-accurate, but it's easy if you don't want to go through the trouble of training a real Naive Bayes classifier. If you want to go that route you can Google for WEKA or MALLET, two good machine learning libraries.

bajafresh4life
+1  A: 

Well an apache project providing maschine learning libraries is Mahout. Its features include the possibility of:

[...] Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. [...]

You can find Mahout under http://mahout.apache.org/

Although I have never used Mahout, just considered it ;-), it always seemd to require a decent amount of theoretical knowledge. So if you plan to spend some time on the issue, Mahout would probably be a good starting point, especially since its well documented. But don't expect it to be easy ;-)

ftiaronsem
Ui, I totally forgot to mention another alternative. If you seek something like mahout, but being easier to implement try out Bobo-Browse: http://sna-projects.com/bobo/Actually I believe Bobo-Browse could do an excellent job in your case.
ftiaronsem